Alert quality management use case implementation guide

Overview

Teams suffer from alert fatigue when they experience high alert volumes and alerts that are not aligned to business impact. As they start to believe that most alerts are false, they may prioritize easy to resolve alerts over others. Also, they may close unresolved incidents so they can stay within their SLA targets.

The result will be slower incident responses, magnified issue scope, and increased severity when true business impacting issues occur.

Alert Quality Management (AQM) focuses on reducing the number of nuisance incidents so that you focus only on alerts with true business impact. This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.

You are a good candidate for AQM if:

You have too many alerts.
You have alerts that stay open for long time periods.
Your alerts are not relevant.
Your customers discover your issues before your monitoring tools do.
You can't see the value of your observability tool(s).

Desired outcome

An alert strategy based on measuring business impact will result in faster response times and greater proactive awareness of critical events. An improved alert signal to noise ratio reduces confusion and improves rapid identification and problem isolation.

AQM's overall goal is to ensure that fewer, more valuable, incidents are created, resulting in:

Increased uptime and availability
Reduced MTTR
Decreased alert volume
The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.

The AQM process described in this guide generates the key performance indicators and metrics that you will use to measure progress towards these goals. The metrics are measured in real time, published in a dashboard, and are used to drive a continuous improvement process that identifies and reduces nuisance alerts and increases user engagement in incident investigation.

AQM does not encompass anomaly detection or AIOps, which are designed to detect unknown or unexpected modes of failure. The two practices (AQM and ML/AI) work hand in hand, they are not mutually exclusive.

Key performance indicators

You will use the AQM process to collect and measure the following KPIs:

Incident volume

Incident Count
Accumulated incident time
Mean Time to Close (MTTC)
Percent Under 5 Minutes

User Engagement

Mean Time to Investigate (MTTI)
% of Incidents Investigated

These KPIs will help you to find the noisiest and least valuable alerts so you can improve their value or eliminate them. You will then use the long term metric trends to show real business impact to management and stakeholders. Detailed information on each metric follows.

Incident volume

You should treat incidents (with or without alerts) like a queue of tasks. Just like a queue, the number of alerts should spend time near zero. Each incident should be a trigger for action to resolve the condition. If an alert does not result in action, then you should question the value of the alert condition.

If you see a constant rate of incidents or specific incidents that are "always-on", then you should question why. Are you in a constant state of business impact, or do you simply have a large volume of noise? The alert volume KPIs help you to answer those questions and to measure progress towards a healthy state of high quality alerting.

Incident Count is the number of incidents generated over a period of time. Typically you should compare the current and previous weeks.

Goal: Reduce the number of low value / nuisance incidents. Best practices:

Ensure condition settings are intended to detect real business impact.
Ensure condition settings are detecting abnormal behavior.
Communicate that the incident details "Acknowledge" feature helps measure meaningful and actionable alerts. See Percentage Incident Acknowledge KPI.
Report AQM KPIs to all stakeholders.

Accumulated incident duration is the total sum of minutes that all the incidents accumulated over a period of time. Typically you should compare the current and previous weeks.

Goal: Reduce the total accumulated minutes of incidents. Best practices:

Do not manually close incidents. Manual closure will skew the real duration of incident length.
Eliminate alerts that do not result in any remediation actions from the recipients.
Improve percent investigated and mean-time-to-investigate KPIs by communicating their importance in improving detection and response times.
Report AQM KPIs to all Stakeholders.

Percentage of incidents where the duration of the incident is under five minutes. This can be an indicator of incident flapping.

Goal: Minimize percentage of incidents with short durations Best practices:

Ensure that conditions are detecting legitimate deviations from expected behavior.
See Baselining and Service Level Management.
Ensure that conditions are detecting legitimate deviations that correlate to business impact or impending business impact.

User engagement

You should measure the value of an incident by the amount of attention it receives. Engagement in this context is measured by whether or not an incident has been acknowledged.

The amount of engagement an individual alert receives is a direct measurement of its value. More engagement implies a valuable alert, less (or zero) engagement implies a nuisance alert that should be modified or disabled.

There is a significant difference between measuring the moment of incident awareness vs. acknowledging the moment resolution activity begins. If you are using an integration with New Relic Alerting, be sure that the "acknowledge" event that is sent to New Relic is triggered when resolution activity begins, not when the incident is sent to the external incident management tool. For more information regarding the standard Incident Management process, see "Incident Management Process: 5 Steps to Effective Resolution Posted on August 31, 2020 by OnPage Corporation. -- in reference to ITIL4"

Incidents acknowledged identifies the percentage of incidents that have been engaged with and had their acknowledged property set to true. Typically you should compare the current and previous weeks.

Goal: Increase the percentage of incident engagement.

Best practices:

Educate the DevOps team on when it is appropriate to acknowledge an incident alert.
Gamify alert acknowledgement to drive usage.
Discourage mass acknowledgement exercises.

Prerequisites

Before you begin, if you don't have equivalent experience, complete the New Relic University (NRU) Overview Course. Also, make sure you have a basic understanding of:

NR1 Alert policy and conditions configuration
NR1 incident notification channel webhook configuration
NR1 NRQL
NR1 alerting best practices
NR1 APM & Infrastructure
How to baseline data in order to determine anomalies vs. normal behavior.

Establish current state

As with any continuous improvement process, the first step of AQM is to establish the current state of your KPIs. To do so, perform the following tasks:

Install and configure the incident event webhook
Install the AQM Dashboard
Perform initial AQM orientation and enablement
Accumulate AQM data
Perform second enablement session

Install and configure the incident event webhook

The webhook will create New Relic events for each incident as it proceeds through its lifecycle (open, acknowledge, close). To ensure that the AQM process generates accurate and valuable findings, this webhook must be added as a notification channel to every alert policy.

The AQM process requires incident, not violation data. This is why you will not be using the default NrAiIncident event, which provides violation data only. Instead, you will use this webhook to send the required incident data to New Relic.

To use the webhook, do the following:

Identify your primary production account and each of your accounts that you will be analyzing with the AQM process.
Install the incident event webhook into each account that will participate in the AQM process and configure the webhook to report nrAQMIncident events to your primary production account.
Assign the webhook as a notification channel to every alert policy in each account.

This example shows a webhook notification channel assigned to each alert policy for a New Relic account with multiple sub-accounts.

The webhook, AQM dashboard, and detailed installation instructions can be found in the New Relic OMA resource center on GitHub.

Install the AQM dashboard

The AQM dashboard is the primary asset that drives the AQM process. You need to install the AQM dashboard into the primary production account you identified in the "Install and configure incident event webhook" step you previously performed by doing the following:

Download the dashboard definition JSON file from the New Relic OMA resource center GitHub repo.
Import the definition into your primary production account.

For more details on importing dashboards, see the New Relic Introduction to dashboards documentation

Perform initial AQM orientation and enablement

During this phase, your incident management team(s) and other stakeholders will learn the goals of the AQM process and the scope of their involvement in it.

The most critical portion of this task is educating your team on the importance of acknowledging incident alerts, since that's how the alert's value is determined. In general, instruct them to follow these guidelines:

If you look at an alert and decide to take any sort of further investigative action, acknowledge the alert.
If you typically close an alert without doing anything else, do not acknowledge the alert.
If the incident alert is always on, do not close or acknowledge it. For further details, see Second Enablement Session.

You can use the first session template presentation to communicate this material to your stakeholders.

Accumulate AQM data

The overall process requires at least two weeks of data before it can proceed. During this time, you should periodically check the following items:

Confirm that incident alert event data is accumulating.
Confirm that the webhook is attached to every alert policy.
Ensure that incident responders are following the alert acknowledgement guidelines.

Perform second enablement session

During this phase, you will introduce incident management teams and other stakeholders to the initial AQM data and the ongoing continuous improvement process you'll be following.

The process consists of four activities:

Review AQM Dashboard and KPI Trending: Here you and the stakeholders will look at the AQM KPIs and identify their week over week trends. The team should identify areas where KPIs are not improving and develop strategies to drive improvement.
Identify Achievements, Challenges, and Opportunities: Here you and the stakeholders will map the current state of alert quality to business impact, identifying areas where improvement has resulted in better business outcomes and areas where problems are impacting business outcomes.
Incident Policy Review: Using the AQM dashboard, you and the stakeholders will identify the noisiest incident policies. Once identified, those policies should be evaluated as detailed in step 4 below.
Alert Policy Recommendations: In this step, you and the stakeholders will review the noisiest policies using the following criteria:
- Do the alerts have any business impact?
- Are the policies properly configured?
  - Are they telling us something about the resource that needs to be fixed?
  - Are the policies necessary? Do they have business impact?
  - Are the thresholds set properly?
Technical recommendations: Here, you and the stakeholders will review any technical recommendations, including:
- Are there application / system problems for engineering to review?
- Are there poorly constructed policies that need to be fixed?
- Are there instrumentation gaps?

You can use the second session template presentation to keep this part of the AQM process organized.

Improvement process

This is the ongoing phase of the continuous improvement process where you periodically review your accumulated AQM data and make adjustments as needed to alert policies. You should perform this step once a week until your alert volume is acceptable. You can then perform it less frequently.

During this phase you should:

Report your KPIs each week to upper management to ensure that the stakeholder teams are appropriately prioritizing the work and to show that progress towards the promised business outcomes are being reached.
Record and retain your weekly KPIs over periods of months to years to establish a baseline and to show the rate of improvement.

You should keep in mind that this is a continuous improvment process, you will continue to collect and evaluate the KPIs over long periods of time to ensure you are meeting your AQM goals.

Value realization

Once the AQM process is established, you will see significant reductions in the volume of alerts while reliability and stability remain the same or improve. In addition, you should see that your alerts have a clear and unambiguous business impact. Your AQM KPIs will provide quantifiable proof of these improvements.

Once you are firmly on the path to AQM's goals, consider moving to other use cases within the Uptime, Performance, and Reliability value stream, such as Service Level Management, or Reliability Engineering. You can also move to other observability maturity value streams, such as Customer Experience.

KPI reference

Following are the descriptions of each KPI as well as sample NRQL queries that will extract them from the New Relic platform. These KPIs are also included in the AQM dashboard that can be downloaded from the New Relic OMA resource center GitHub repo.

Incident volume

Incident Count is the number of incidents generated over a period of time. Typically you should compare the current and previous weeks.

NRQL:

FROM nrAQMIncident SELECT count(*) AS 'Incident Count' WHERE current_state='open' AND severity='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

Accumulated incident duration is the total sum of minutes that all the incidents accumulated over a period of time. Typically you should compare the current and previous weeks.

NRQL:

FROM nrAQMIncident SELECT sum(duration)/(1000*60) AS 'Incident Minutes' WHERE current_state='closed' AND severity='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

Average duration of incidents within the period of time measured.

NRQL:

FROM nrAQMIncident SELECT average(duration/(1000*60)) AS 'Incident MTTC (minutes)' WHERE current_state='closed' AND severity='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

Percentage of incidents where the duration of the incident is under five minutes. This can be an indicator of incident flapping.

NRQL:

FROM nrAQMIncident SELECT percentage(count(*), WHERE duration <= 5 * 60 * 100) AS '% Under 5min' WHERE current_state='closed' AND severity='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

Incident engagement

Incidents acknowledged identifies the percentage of incidents that have been engaged with and had their acknowledged property set to true. Typically you should compare the current and previous weeks.

NRQL:

FROM nrAQMIncident SELECT filter(count(*), WHERE current_state='acknowledged')/filter(count(*), WHERE current_state='open')*100 AS '% Investigated' WHERE severity='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

Mean time to investigate identifies the average time it takes for an incident to be triaged. Typically you should compare the current and previous weeks.

NRQL:

FROM nrAQMIncident SELECT average(duration/(1000*60)) AS 'Incident MTTI (minutes)' WHERE current_state='acknowledged' AND severity='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO

Additional resources

Want to get your hands dirty before you start implementing this in your account? Check out the alert quality management lab

Alert quality management use case implementation guide

Overview

Desired outcome

Key performance indicators

Incident volume

Incident count KPI

Accumulated incident duration KPI

Mean time to close (MTTC) KPI

Percent under 5 minutes KPI

User engagement

Percentage Acknowledged KPI

Mean time to investigate (MTTI) KPI

Prerequisites

Establish current state

Install and configure the incident event webhook

Install the AQM dashboard

Perform initial AQM orientation and enablement

Accumulate AQM data

Perform second enablement session

Improvement process

Value realization

KPI reference

Incident volume

Incident count KPI

Accumulated incident duration KPI

Mean time to close (MTTC) KPI

Percent under 5 minutes KPI

Incident engagement

Percentage Acknowledged KPI

Mean time to investigate (MTTI) KPI

Additional resources