In order to create service levels you need a role with permissions to modify and delete metric rules.
Key concepts to create SLIs and SLOs
Keep in mind these concepts when defining SLIs and SLOs.
The SLI entity
In the New Relic ecosystem, SLIs and SLOs are linked to entities, which are all the elements in your stack that report data to us, or that generate data that we have access to. The entity that an SLI is related to determines where the SLI/SLO results show. And the entity's tags help filter down SLI results on the service levels view.
You can define SLIs on any NRDB event that is reported to New Relic, and therefore you can also base SLIs on custom events. Most custom events are not related to a single New Relic entity, but provide higher level business and user experience insights. In this case, you can still relate the SLI to a specific entity or to a workload.
SLI queries
SLIs are defined as the percentage of good responses out of the total number of valid requests. Most often you’ll set up your SLIs by defining the valid and good pieces:
A valid request is any request that you want to count as meaningful for your SLIs (for example, all transactions related to an endpoint that weren’t initiated by a health check).
A good response is any response that you consider to provide a good output for the end-user or client service (for example, the service responded in less than 2 seconds, providing a good navigation experience for the end user).
Alternatively, you can define what you consider to be the bad responses instead:
A bad response is any response that you consider to provide a bad output (for example, the service responded with a server error, causing the client to fail its flow). New Relic will automatically derive the count of good responses as valid - bad.
Request-based SLOs are based on an SLI defined as the ratio of the number of good requests to the total number of requests. A request-based SLO is met when that ratio meets or exceeds the goal for the compliance period.
Suggested SLIs
In this section you’ll find some SLIs that are typically used to measure the performance of services and browser applications.
SLIs for APM services instrumented with the New Relic agent
Based on Transaction events, these SLIs are the most common for request-driven services:
Service success is the ratio of the number of successful responses to the number of all requests. This effectively is an error rate, but you can filter it down, for example removing expected errors.
Valid events fields
FROM: Transaction
WHERE: entityGuid = '{entityGuid}'
Where {entityGuid} is the service's GUID.
Bad events fields
FROM: TransactionError
WHERE: entityGuid = '{entityGuid}' AND error.expected IS FALSE
Where {entityGuid} is the service's GUID.
A latency SLI measures the proportion of valid requests that were served faster than the threshold established as a good experience.
In order to determine that duration threshold, check how the service has been performing in the past weeks, and use that result as a realistic and achievable baseline. Afterwards, you can iterate on the SLI threshold, and align it with a more ambitious performance.
To select an appropriate value for the duration condition, one typical practice is to select the 95 percentile duration of the responses for the last 7 or 15 days. Find this duration threshold using the query builder, and use it to determine what you consider to be good events for your SLI:
select percentile(duration, 95) from Transaction where entityGuid = '{entityGuid}' since 7 days ago limit max
Valid events fields
FROM: Transaction
WHERE: entityGuid = '{entityGuid}' AND transactionType = 'Web'
Where {entityGuid} is the service's GUID.
Good events fields
FROM: Transaction
WHERE: entityGuid = '{entityGuid}' AND transactionType = 'Web' AND duration < {duration}
Where {entityGuid} is the service's GUID.
Where {duration} is the response time that you consider provides a good experience for your client service or end-user, in seconds.
SLIs for APM services instrumented with OpenTelemetry
Based on OpenTelemetry spans, these SLIs are the most common for request-driven services:
Service success is the ratio of the number of successful responses to the number of all requests. This effectively is an error rate, but you can filter it down, for example removing expected errors.
Valid events fields
FROM: Span
WHERE: entity.guid = '{entityGuid}' AND (span.kind IN ('server', 'consumer') OR kind IN ('server', 'consumer'))
Where {entityGuid} is the service's GUID.
Bad events fields
FROM: Span
WHERE: entity.guid = '{entityGuid}' AND (span.kind IN ('server', 'consumer') OR kind IN ('server', 'consumer')) AND otel.status_code = 'ERROR'
Where {entityGuid} is the service's GUID.
A latency SLI measures the proportion of valid requests that were served faster than the threshold established as a good experience.
In order to determine that duration threshold, check how the service has been performing in the past weeks, and use that result as a realistic and achievable baseline. Afterwards, you can iterate on the SLI threshold, and align it with a more ambitious performance.
To select an appropriate value for the duration condition, one typical practice is to select the 95 percentile duration of the responses for the last 7 or 15 days. Find this duration threshold using the query builder, and use it to determine what you consider to be good events for your SLI:
select percentile(duration.ms, 95) from Span where entityGuid = '{entityGuid}' AND (span.kind IN ('server', 'consumer') OR kind IN ('server', 'consumer')) since 7 days ago limit max
Valid events fields
FROM: Span
WHERE: entity.guid = '{entityGuid}' AND (span.kind IN ('server', 'consumer') OR kind IN ('server', 'consumer'))
Where {entityGuid} is the service's GUID.
Good events fields
FROM: Span
WHERE: entity.guid = '{entityGuid}' AND (span.kind IN ('server', 'consumer') OR kind IN ('server', 'consumer')) AND duration.ms < {duration}
Where {entityGuid} is the service's GUID.
Where {duration} is the response time that you consider provides a good experience for your client service or end-user, in seconds.
SLIs for browser applications
The following SLIs are based on Google's Browser Core Web Vitals.
It's the proportion of page views that are served without errors.
Valid events fields
FROM: PageView
WHERE: entityGuid = '{entityGuid}'
Where {entityGuid} is the service's GUID.
Bad events fields
FROM: JavaScriptError
WHERE: entityGuid = '{entityGuid}' AND firstErrorInSession IS true
Where {entityGuid} is the browser app GUID.
It’s the proportion of valid page views where the largest content element visible in the viewport was rendered faster than the threshold considered to correspond to a good experience.
Valid events fields
FROM: PageViewTiming
WHERE: entityGuid = '{entityGuid}' AND largestContentfulPaint IS NOT NULL
Where {entityGuid} is the service's GUID.
Good events fields
FROM: PageViewTiming
WHERE: entityGuid = '{entityGuid}' AND largestContentfulPaint < '{largestContentfulPaint}'
Where {entityGuid} is the browser app GUID.
Where {largestContentfulPaint} is the amount of time (in milliseconds) to render the largest content element visible in the viewport that you consider provides a good experience for your end user. A frequent standard is 4000 ms.
To determine a realistic number to use for {largestContentfulPaint} in your environment, one typical practice is to select the 95 percentile duration of the responses for the last 7 or 15 days. Find it by using the query builder:
SELECT percentile(largestContentfulPaint, 95) FROM PageViewTiming WHERE entityGuid = '{entityGuid}' since 7 days ago limit max
It’s the proportion of page views where the time between a user's first interacion with the page and the time when the browser responds to that interaction is less than a certain threshold.
Valid events fields
FROM: PageViewTiming
WHERE: entityGuid = '{entityGuid}' AND firstInputDelay IS NOT NULL
Where {entityGuid} is the service's GUID.
Good events fields
FROM: PageViewTiming
WHERE: entityGuid = '{entityGuid}' AND firstInputDelay < {firstInputDelay}
Where {entityGuid} is the browser app GUID.
Where {firstInputDelay} is the amount of time (in milliseconds) the browser should respond in to provide a good experience for your end user. A frequent standard is 300 ms.
To determine a realistic number to use for {firstInputDelay} in your environment, one typical practice is to select the 95 percentile duration of the responses for the last 7 or 15 days. Find it by using the query builder:
SELECT percentile(firstInputDelay, 95) FROM PageViewTiming WHERE entityGuid = '{entityGuid}' since 7 days ago limit max facet deviceType
It’s the proportion of page views with a good cumulative layout shift (CLS). CLS is described as the total sum of all individual layout shift scores for every unexpected layout shift that occurs during the entire lifespan of the page. A layout shift occurs any time a visible element changes its position from one rendered frame to the next.
Valid events fields
FROM: PageViewTiming
WHERE: entityGuid = '{entityGuid}' AND cumulativeLayoutShift IS NOT NULL
Where {entityGuid} is the service's GUID.
If you’d like to create separate SLIs to track CLS in desktop and mobile devices separately, add one of these clauses at the end of the field:
and deviceType = 'Mobile'
and deviceType = 'Desktop'
Good events fields
FROM: PageViewTiming
WHERE: entityGuid = '{entityGuid}' AND cumulativeLayoutShift < {cumulativeLayoutShift}
Where {entityGuid} is the browser app GUID.
Where {cumulativeLayoutShift} is a pre-set value. To provide a good user experience, your site should strive to have a CLS score of 0.1 or less. A CLS score of 0.25 or more is considered a poor user experience.
If you’ve decided to create separate SLIs to track CLS in desktop and mobile devices separately when you defined the valid events query, add this clause at the end of the field:
and deviceType = 'Mobile'
and deviceType = 'Desktop'
To determine a realistic number to select for {cumulativeLayoutShift} in your environment, one typical practice is to select the 75th percentile of page loads for the last 7 or 15 days, segmented across mobile and desktop devices. Find it by using the query builder:
SELECT percentile(cumulativeLayoutShift, 95) FROM PageViewTiming WHERE entityGuid = '{entityGuid}' since 7 days ago limit max facet deviceType
Create and edit service levels
You can create SLIs and SLOs from several places on New Relic One:
From the Service levels view on the top menu. You can associate the SLI with any entity across your accounts.
From the Service levels page in any APM Service. The SLI will be associated with that specific APM service. If you use this starting point, New Relic will automatically create the most common service level indicators for this entity type, based on the latest available data.
From the Service levels tab in any workload. You can associate the SLI with any entity in the workload, or the workload itself.
Follow these steps:
In order to define your new SLI, choose one of these two options:
Entity data: Base the SLI on standard data coming from New Relic One agents. This is the most common option. If this is your choice, select the entity (for example, APM service) you want to use.
Custom data: Alternatively, you can base the SLI on your custom NRDB events. Use this option when you can't relate the service level data to a specific entity, or when you want to relate the service level directly to a workload.
In this step, you'll configure the SLI count queries that determine which event is valid, or good, or bad.
If you associate the SLI with an APM service or a browser app, New Relic will suggest some typical SLI and their queries. We'll use the latest data as a baseline for your service level objectives, and you will be able to edit the SLI and SLO afterwards.
If you're using a different type of entity, or you want to customize the baseline values provided by New Relic, you can customize the SLI to your needs. For instance, you can use the WHERE clause to filter out health checks.
The account where the data is gathered from matches the account of the entity that the SLI refers to. Please see the section above to know what goes into each field.
On the right you'll see the final queries, and at the bottom you'll get a preview of the count of valid and good/bad events in the last days.
Important
SLI queries support NRDB events and spans, but not dimensional metrics yet.
In this step you'll get a preview of the SLI value, and add one SLO for this SLI: Just select the length of the time window and the percentage target. The chart on the left will help you anticipate whether the target you're setting is feasible or if it's often missed.
Rolling time-window SLOs are supported. With a rolling time-window, the SLO compliance takes into account the last N days. Every minute, the oldest data drops out of the current calculation and new data replaces it.
Select a short name for your SLI that helps you recognize what it’s measuring. Optionally, you may add a description.
Edit SLIs
After you’ve created an SLI, access the UI and click on the ... menu to edit it: