alerting – Page 2 – Robust Perception | Prometheus Monitoring Experts

July 23, 2018

Absent Alerting for Scraped Metrics

In the previous post we looked at dealing with when all the targets for a job had disappeared. What if you wanted to alert on specific metrics from one target disappearing?

Published by Brian Brazil in Posts

Tags: alerting, prometheus, promql

July 16, 2018

Absent Alerting for Jobs

Alerting on numbers being too big or small is easy with Prometheus. But what if the numbers go missing?

Published by Brian Brazil in Posts

Tags: alerting, prometheus, promql

April 30, 2018

Identifying expensive alerting rules

Since Prometheus 2.1 there is a feature to view alerting rule evaluation times in the rules UI. In this blogpost we'll see an example of how this can be used to identify an expensive rule expression.

Published by Conor Broderick in Posts

Tags: alerting, prometheus, promql

April 16, 2018

Alerting is an art. One must be sure to alert just enough to be aware of all problems arising in the monitored system while at the same time not drown out the signal with excess noise. In this blogpost we'll explain some of the best practices to use when alerting with Prometheus.

Published by Conor Broderick in Posts

Tags: alerting, best practices, prometheus

March 19, 2018

Alerting on crash loops with Prometheus

If your applications are restarting regularly, whether due to segfaults or OOMs, it'd be nice to know.

Published by Brian Brazil in Posts

Tags: alerting, prometheus, promql

February 12, 2018

Alerting on gauges in Prometheus 2.0

One of the major changes introduced in Prometheus 2.0 was that of staleness handling. Previously for instant vectors, Prometheus would return a point up to 5 minutes in the past which caused a number of different issues.

Published by Conor Broderick in Posts

Tags: alerting, prometheus, promql

December 18, 2017

What’s the difference between group_interval, group_wait, and repeat_interval?

In this blogpost we try and clear up some confusion by outlining the key differences between commonly confused alerting configuration options: group_interval, group_wait, and repeat_interval.

Published by Conor Broderick in Posts

Tags: alerting, alertmanager, prometheus

December 4, 2017

Using time series as alert thresholds

Usually alert thresholds are hardcoded in the alert. In more sophisticated setups, it would be useful for it to be parameterised based on another time series.

Published by Brian Brazil in Posts

Tags: alerting, prometheus, promql

October 30, 2017