Service Level Management is the practice of standardizing data into a universal language that can be communicated easily to all stakeholders. IT does not usually speak business and business does not usually speak IT, so an observability language barrier must be resolved first in order to improve reliability.
This need for a universal language to articulate reliability is what has re-popularized Service Level Management. Service level management is known best in the practice of Uptime, Performance and Reliability; however, service level management also applies to Customer Experience, Innovation and Growth, and Operational Efficiency.
This implementation guide will teach you the practice of service level management in the context of Uptime, Performance and Reliability.
The required business outcome in the practice of reliability is to reduce the number of business-impacting incidents, their duration, and the number of people involved in those incidents.
- Reduce the number of business-disrupting incidents
- Reduce Mean-Time-to-Resolve (MTTR)
- Reduce average people engaged (FTEs) per severe incident
The required operational outcome of Service Level Management within a reliability practice is to communicate digital product health successfully. Operational success is measured by what percentage of critical product applications are covered by standard service levels and percentage of adoption by primary stakeholders. This is achieved by staying focused on what is important to the stakeholders, standardization, ensuring simplicity, a sprinkle of consulting, and proving the effectiveness of service level management.
The reality is that your service is a "digital product" and that product is only as good as how well it is received by your users. Your technology, as complex as it can be, is a closed system mostly unseen by the internet save your external APIs. The totality of your digital product health is the ability of that product to respond to requests, render content, and process data.
The most critical questions to answering digital product (service) health are:
- How fast and successfully can we deliver responses?
- Can clients connect to our service?
- Can our clients render content fast and successfully?
- Can we process data fast and successfully?
Service health is not the same as diagnostic data. Service health is not exclusively infrastructure data, nor is it exclusively end-user client performance data. Interestingly, many look first to real user monitoring (RUM) or end-to-end transaction data (distributed tracing) to identify health but are left with many more questions.
It is common to debate reactive and proactive practices when establishing initial health data. For example, if I know hardware performance of our infrastructure I can predict failure. There was a time the previous statement was true when monolith architecture was dominant. This is not entirely true today because distributed systems do not have a linear relationship (1:1) between hardware performance and the output performance of applications that sit on that hardware.
In order to truly be proactive you must first establish real input and output datapoints. You must first know what to measure before you can react. You must first learn how to react before you can pro-act (be proactive). Keep it simple and build your skills incrementally. This guide will show you the fastest path to proactive behaviors.
The reality is that starting at your application layer, closest to the customer but before the client device, is the fastest path to observing health data from the customer's perspective.
You will learn to establish and measure service level objectives for output performance and input performance in this guide as defined below.
Client performance is covered in our Customer Experience - Quality Foundation implementation guide.
Data quality is not covered in this guide because each use case varies greatly depending on the data inputs, outputs, and desired results.
It is strongly recommended that you first accomplish Output Performance and Input Performance before proceeding to Client Performance. Output and input SLOs are very easy to create, and there is a much greater return on your investment to have Output and Input health data first. In addition to be easy to establish, input and output datapoints will provide you with a remediation path much sooner in your reliability journey.
- Achieve basics skills in New Relic One Dashboards and NRQL
- Complete the Service Level Management Free Online Course
- Review the Service Level Management UI Product
- Identify your service
- Identify your service boundary
- Establish your baseline
- Create your service level
The following is assumed:
- Your primary applications are instrumented with New Relic APM agents.
- Your application names follow a familiar naming convention as outlined in our Service Characterization Use Case Guide.
- You are familiar with how to find your application in the New Relic Explorer.
Find your application (APM entity type) and select it. You should see the overview screen below.
Do not yet click Service Levels.
The goal here is to ensure you are measuring the output of your service, first. While dependencies of that application each play a part in response times and success rates, the final and total response time and success is easily measured at the point where the request is received and responded to.
In the screenshot below you are responsible for all applications that support order processing. You selected #2 (Order-Composer) to start, clicked Service maps, and discovered that Order-Composer is really a dependency; therefore, you will need to select #1 (Order-Processing) in order to establish a true health service level.
Your team may only be accountable for the dependency, Order-Composer. If that's the case, then your own service level on Order-Composer is perfectly acceptable for your own self-monitoring of performance. Be sure to tag your own non-customer facing service levels as
customer-facing:false to allow for better filtering in health reports. Also, consider collaborating with the customer-facing endpoint (#1 Order-Proessing) in your observability journey in order to establish true output performance, an input connectivity service level, and client service levels.
Establishing a baseline is a critical step to accelerating adoption and implementation of service levels. It's more challenging to determine what the design specifications are or should have been for services. Establishing a baseline allows you to measure the current performance of a service and then, through the service level reports, you will know if you are hitting that baseline or degrading.
You can create a baseline for virtually any dataset; however, there are different formulas and recommendations for different use cases. For exampl,e you should use the average for some datasets, percentiles for others, and max for others.
When starting service levels you should start with output performance of your applications. For this we use response times (Latency) and percentage of non-errors (Success).
Not much in fact. You are establishing a reliability health metric. Seasonality and peak usage is not a handicap for good performance. Also, the more history you include in your measurement, the more likely you are including different codebases from releases. Previous deployments, no matter how small, could skew your results.
The recommended history is 1-2 weeks of performance data to establish a fair baseline.
Below is an example NRQL query that represents the recommended target for a 7-day service level objective for latency.
FROM Transaction SELECT percentile(duration, 95) AS 'Latency Baseline SLI' WHERE appName='Order-Processing' SINCE 1 WEEK AGO
For a success (error-free) baseline, try the following query. Be sure to substitute
appName='Order-Processing' for your own application name.
FROM Transaction SELECT percentage(count(*), WHERE error is false) AS 'Success Baseline SLI' SINCE 1 WEEK AGO WHERE appName='Order-Processing'
Great news! The New Relic platform will calculate recommended APM and browser baselines for you.
NOTE: If you do not see the Add a service level button, you will need to check with your administrator regarding your permissions.
The "Identifying your service" section above shows you how to find your application APM data. You'll see #2 in the screenshot in that same section, called "Service Levels." Find your application APM data and click Service levels. You should see the view below.
Click Add baseline service level objectives and almost instantly you will have both your Latency SLI and Success SLI and their respective objectives created for you.
You can view and change all the settings by clicking the three dot icon in the upper-right corner of each SLO scorecard.
NOTE: It will take approximately 10 minutes for data to populate the SLO scorecard. This is because we use Events-to-Metrics for data longevity and query performance. It takes a moment for the conversion to take place and retroactively populate the data.
Having two output service levels for each application and capability can quickly become overwhelming. To simplify this, you can combine these two indicators and have just one output service level.
NOTE: Service levels that use different datasets should not be combined. The easy way to understand this is only combine service levels that use the same dataset and same count of valid transactions. For example, Latency and Success both use
Transactions and are compared against the same quantity of valid transactions. Synthetic checks use a different dataset; therefore, it should not be combined with
Transaction-based service levels.
To combine Latency and Success, select the
Latency service level and add
AND error is false in the second query box that represents "good" events. You will also want to change the name of the SLO to simply "Output Performance SLO" and the description to include "and error-free." See screenshot below.
You can now delete the other service level using the three-dot icon.
Create your synthetic check.
Create your service level indicator.
The most common input performance service level is often referred to as "Connectivity" or "Uptime." This is a simple check against a health API endpoint or simply loading a URL. Both of these can be done easily by using our synthetic monitoring service. Please refer to our add simple browser monitor and add scripted api test documentation to begin collecting data.
You should now have data after completing step 1.
Now you will use the Service Level Management service to create an input indicator and objective.
Use the Explorer menu to select Service levels, and then click + Add a service level indicator.
NOTE: If you do not see the Add a service level button, you will need to check with your administrator regarding your permissions.
Next filter your entity types to
Synthetic monitors. See screenshot below.
Now find your synthetic monitor in the list and click it. This will enable the Continue button in the left panel. Click it.
Use the screenshot below as a visual reference and follow the next steps:
- Copy the entity GUID to your clipboard using the clipboard icon.
- Select SyntheticCheck in the first FROM pulldown menu.
- In the first query box, type
entityGUID=''and paste your clipboard into the single quotes. This will identify all valid tests.
result='SUCCESS'in the second query box. This will define good by selecting only successful tests.
- Complete the SLO threshold with a recommended 99.99% target and complete your description.
This is where you will really accelerate adoption of service levels!
You do not need to have intimate knowledge of an application or service in order to complete this task. You simply need to know where the consumer-facing API (service boundary) is, and follow the steps below.
This is a major step in your observability maturity journey. Having service levels on critical business capabilities, like login or authorize payment, will rapidly close the language barrier between IT and business. Service level scores on capabilities also provide you with a more precise remediation path when their service levels begin to degrade. For example, if the login service level begins to degrade, you will know to look at identity mangement dependencies and workflows starting at the consumer-facing API.
NOTE: in this task you are building on the skills you learned in the "Establishing and output SLI" section.
- Assess application capabilities.
- Baseline a capability.
- Create your capability service level.
Identify the service boundary application as described in the Establishing and output SLI section above.
Now run the following NRQL query to identify baselines on most frequently used transactions. Be sure to replace
appName='Order-Processing' with the application name you identified.
FROM Transaction SELECT count(*), percentile(duration, 95) WHERE appName='Order-Processing' FACET name SINCE 1 WEEK AGO
You should see something similar to the screenshot below.
You'll see the first transaction (#3) states it has something to do with "purchase." You can now create a "purchase" capability service level.
NOTE: Even if you are not sure that this transaction represents the purchase capability, this exercise makes a great example to show the application team and your leadership the value of capability service levels. Remember, your goal here is to start a conversation with the stakeholders by showing the art-of-the-possible.
WHERE name='Controller/Sinatra//purchase' to the end of your query, replacing
Controller/Sinatra//purchase with your transaction name. Run the query to make sure it works. You should now see only the one transaction in your result. Copy this query and the
DURATION (95%) result into a notepad. You'll need both in a moment.
Create a new service level in the platform. Starting a new service level is described in Establishing an input performance SLI.
In this case you want to find your application (APM
Entity type) in the list so we can retain the metadata (tags) through the entity guid. Instead of "Synthetic monitor" as in section above, select "APM" in the entity filter pulldown.
Select the "Latency" guided workflow so the good and valid queries are auto-populated for you.
Use your notepad to copy just the
Add this condition to both queries in service levels, preceeded with an
AND , as underlined in the screenshot below.
Simply adjust the
duration < 1.78 portion of the second query to match the
DURATION (95%) result in your original baseline copied to your notepad.
AND error is false to the same query.
Congratulations! You now have a combined Latency and Success service level for a capability in the language of the business.
Proceed to name this service level, update the description, and save the service level.
It's recommended to set up a few of these capability service levels and present to the application team and your leadership for feedback.
Alert quality management is another observability maturity practice that marries really well with service level management. The value of both alerting quality data side-by-side with service level data is that you can see if alert policies are aligned with real impact or are just creating noise. You will find be able to validate good alerts, missing alerts, and just noisy alerts.
You can do this by creating a custom dashboard with an SLI compliance query side-by-side with an alerting quality query.
Be sure to check-out alerting quality management next.
Improvement of service levels and reliability requires adoption of the practice by all the stakeholders of the service. This includes, but is not limited to, engineering management, product management, and executive management. The primary goal is to quickly demonstrate the power and value of service levels to stakeholders in order to start a meaningful discussion on what really matters to those stakeholders. The steps in this guide will get you those meaningful discussions very quickly.
A proven method, with a high rate of adoption, is to first establish output performance and input performance service levels for one digital product and its critical capabilities. This usually involves one overall output and input service level for each endpoint application (usually one or two), and then approximately 4-7 output performance service levels for assumed critical capabilities measured at the endpoint transaction.
This method includes not surveying each stakeholder for what should and shouldn't be measured. Surveys usually result in long wait times, lots of questions, frustration, lack of demonstrated value, and not so good answers. Remember, start with baselines and key transactions as "capabilities."
Freely make assumptions of what these endpoints are and what endpoint transactions make up what capabilities as demonstrated above. Accuracy is not the key at first. What is key to a successful kick-off is demonstrating the ability to easily measure and communicate health. That initial demonstration will show the value in investing more time to refine what is and what isn't measured in primary service levels.
Don't wait. The sooner you provide that demonstration and the more complete that demonstration is, the sooner you will achieve broader adoption and begin the reliability improvement process in collaboration with all the stakeholders!
Once you have established what works (and what doesn't) for your stakeholders, you can then begin to design SLM at scale with automation. You can start learning about automating service level management by studying the New Relic Terraform library.
The next step is to add in customer experience service levels measured at the client browser or mobile device. Again, it's important you first prove value as described in the improvement process above. Remember, observability is a journey, and maturity takes time, practice, and patience.
See Quality Foundation in our Customer Experience section to proceed on your journey.
Also see alerting quality management.