In software development and operations, it is common to have a group consisting of members you expect to behave approximately the same. For example: for servers using a load balancer, the traffic to the servers may go up or down, but the traffic for all the servers should remain in a fairly tight grouping. See outlier detection in action in this NerdBytes video (2:51 minutes).
The NRQL alert outlier detection feature parses the data returned by your faceted NRQL query and:
Looks for the number of expected groups that you specify
Looks for outliers (values deviating from a group) based on the sensitivity and time range you set
Additionally, for queries that have more than one group, you can choose to be notified when groups start behaving the same.
This visual aid will help you understand the types of situations that will trigger a violation and those that won't.
Note: this feature does not take into account the past behavior of the monitored values; it looks for outliers only in the currently reported data. For an alert type that takes into account past behavior, see Baseline alerting.
Example use cases
These use cases will help you understand when to use the outlier threshold type. Note that the outlier feature requires a NRQL query with a FACET clause.
A load balancer divides web traffic approximately evenly across five different servers. You can set a notification to be sent if any server starts getting significantly more or less traffic than the other servers.
Example query:
SELECT average(cpuPercent) FROM SystemSample WHERE apmApplicationNames = 'MY-APP-NAME' FACET hostname
Application instances behind a load balancer should have similar throughput, error rates, and response times. If an instance is in a bad state, or a load balancer is misconfigured, this will not be the case. Detecting one or two bad app instances using aggregate metrics may be difficult if there is not a significant rise in the overall error rate of the application.
You can set a notification for when an app instance’s throughput, error rate, or response time deviates too far from the rest of the group.
Example query:
SELECT average(duration) FROM Transaction WHERE appName = 'MY-APP-NAME' FACET host
An application is deployed in two different environments, with ten application instances in each. One environment is experimental and gets more errors than the other. But the instances that are in the same environment should get approximately the same number of errors.
You can set a notification for when an instance starts getting more errors than the other instances in the same environment. Also, you can set a notification for when the two environments start to have the same number of errors as each other.
The number of logged in users for a company is about the same for each of four applications, but varies significantly by each of the three time zones the company operates in.
You can set a notification for when any application starts getting more or less traffic from a certain time zone than the other applications. Sometimes the traffic from the different time zones are the same, so you would set up the alert condition to not be notified if the time zone groups overlap.
As of March 31, 2022, we're discontinuing support for several capabilities, including NRQL outlier alert conditions. For more details, including how you can easily prepare for this transition, see our Explorers Hub post and our transition guide for alert capabilities.
To create a NRQL alert that uses outlier detection:
When creating a condition, under Select a product, select NRQL.
Here are the rules and logic behind how outlier detection works:
After the condition is created, the query is run once every harvest cycle and the condition is applied. Unlike baseline alerts, outlier detection uses no historical data in its calculation; it's calculated using the currently collected data.
Alerts will attempt to divide the data returned from the query into the number of groups selected during condition creation.
For each group, the approximate average value is calculated. The allowable deviation you have chosen when creating the condition is centered around that average value. If a member of the group is outside the allowed deviation, it produces a violation.
If Trigger when groups overlap has been selected, alerts detects a convergence of groups. If the condition is looking for two or more groups, and the returned values cannot be separated into that number of distinct groups, then that will produce a violation. This type of “overlap” event is represented on a chart by group bands touching.
Because this feature does not take past behavior into account, data is never considered to "belong" to a certain group. For example, a value that switches places with another value wouldn't trigger a violation. Additionally, an entire group that moves together also wouldn't trigger a violation.
The number of unique values returned must be 500 or less. If the query returns more than this number of values, the condition won't be created. If the query later returns more than this number after being created, the alert will fail.
When a query returns a set of values, only values that are actually returned are taken into account. If a value is not available for calculation (including if it goes from being collected one harvest cycle to not being collected), it is rendered as a zero and is not considered. In other words, the behavior of unreturned zero values will never trigger violations.