Knowing which instances of your services and which machines in your fleet are no longer responding is a common requirement. Whether it's to get someone to investigate or to drive automation, in this post I'll look at how you can do it with Prometheus.
I'll presume you've already setup monitoring and are scraping the instances, whether that be the Node Exporter for machine monitoring or some other exporter. To generate an alert for each instance that has been down for 10 minutes:
groups: - name: node.rules rules: - alert: InstanceDown expr: up{job="node"} == 0 for: 10m
The power of labels means that you only need to define this alert once, and it automatically applies to all of your instance with a node
label!
A single instance going down shouldn't be worth waking someone up over. How about only alerting when 25% of the instances are down?
groups: - name: node.rules rules: - alert: InstancesDown expr: avg(up{job="node"}) BY (job)These simple examples show just a small glimpse of the power of Prometheus alerting.
No comments.