Create infrastructure "host not reporting" condition

Use infrastructure monitoring's Host not reporting condition to notify you when we've stopped receiving data from an infrastructure agent. This feature allows you to dynamically alert on groups of hosts, configure the time window from five to 60 minutes, and take full advantage of alerts notifications.

Features

You can define conditions based on the sets of hosts most important to you, and configure thresholds appropriate for each filter set. The Host not reporting event triggers when data from the infrastructure agent doesn't reach our collector within the time frame you specify.

Caution

If you have filtered your Host Not Reporting condition using tags or labels and then remove a critical tag or label from a targeted host, the system will open a Host Not Reporting violation, since it will characterize that host as having lost its connection.

This feature's flexibility allows you to easily customize what to monitor and when to notify selected individuals or teams. In addition, the email notification includes links to help you quickly troubleshoot the situation.

Host not reporting condition	Features
What to monitor	You can use filter sets to select which hosts you want to be monitored with the alert condition. The condition will also automatically apply to any hosts you add in the future that match these filters.
How to notify	Conditions are contained in policies. You can select an existing policy or create a new policy with email notifications from the Infrastructure monitoring UI. If you want to create a new policy with other types of notification channels, use the UI.
When to notify	Email addresses (identified in the policy) will be notified automatically about threshold violations for any host matching the filters you have applied, depending on the policy's incident preferences.
Where to troubleshoot	The link at the top of the email notification will take you to the infrastructure Events page centered on the time when the host disconnected. Additional links in the email will take you to additional detail.

Create "host not reporting" condition

To define the Host not reporting condition criteria:

Follow standard procedures to create an infrastructure condition.
Select Host not reporting as the Alert type.
Define the Critical threshold for triggering the notification: minimum 5 minutes, maximum 60 minutes.
Enable 'Don't trigger alerts for hosts that perform a clean shutdown' option, if you want to prevent false alerts when you have hosts set to shut down via command line.
Currently this feature is supported on all Windows systems and Linux systems using systemd. Alternatively, you can add the hostStatus: shutdown tag to your host along with checking the option mentioned above. This will stop all Host Not Reporting violations from opening for that host, as long as that tag is on it, regardless of agent version or OS. Removing the tag will allow the system to open Host Not Reporting violations for that host again.

Depending on the policy's incident preferences, it will define which notification channels to use when the defined Critical threshold for the condition passes. To avoid "false positives," the host must stop reporting for the entire time period before a violation is opened.

Example: You create a condition to open a violation when any of the filtered set of hosts stop reporting data for seven minutes.

If any host stops reporting for five minutes, then resumes reporting, the condition does not open a violation.
If any host stops reporting for seven minutes, even if the others are fine, the condition does open a violation.

Investigate the problem

To further investigate why a host is not reporting data:

Review the details in the email notification.
Use the link from the email notification to monitor ongoing changes in your environment from Infrastructure monitoring's Events page. For example, use the Events page to help determine if a host disconnected right after a root user made a configuration change to the host.
Optional: Use the email notification's Acknowledge link to verify you are aware of and taking ownership of the alerting incident.
Use the email links to examine additional details in the Incident details page.

Intentional outages

We can distinguish between unexpected situations and planned situations with the option Don't trigger alerts for hosts that perform a clean shutdown. Use this option for situations such as:

Host has been taken offline intentionally.
Host has planned downtime for maintenance.
Host has been shut down or decommissioned.
Autoscaling hosts or shutting down instances in a cloud console.

We rely on Linux and Windows shutdown signals to flag a clean shutdown.

We've confirmed that these scenarios are detected by the agent:

AWS Auto-scaling event with EC2 instances that use systemd (Amazon Linux, CentOs/RedHat 7 and newer, Ubuntu 16 and newer, Suse 12 and newer, Debian 9 and newer)
User-initiated shutdown of Windows systems
User-initiated shutdown of Linux systems that use systemd (Amazon Linux, CentOs/RedHat 7 and newer, Ubuntu 16 and newer, Suse 12 and newer, Debian 9 and newer)

We know that these scenarios are not detected by the agent:

User-initiated shutdown of Linux systems that don't use systemd (CentOs/RedHat 6 and earlier, Ubuntu 14, Debian 8). This includes other modern Linux systems that still use Upstart or SysV init systems.
AWS Auto-scaling event with EC2 instances that don't use systemd (CentOs/RedHat 6 and earlier, Ubuntu 14, Debian 8). This includes other more modern Linux systems that still use Upstart or SysV init systems.