In the previous post we looked at dealing with when all the targets for a job had disappeared. What if you wanted to alert on specific metrics from one target disappearing?
A blog on monitoring, scale and operational Sanity
In the previous post we looked at dealing with when all the targets for a job had disappeared. What if you wanted to alert on specific metrics from one target disappearing?
Alerting on numbers being too big or small is easy with Prometheus. But what if the numbers go missing?
Since Prometheus 2.1 there is a feature to view alerting rule evaluation times in the rules UI. In this blogpost we'll see an example of how this can be used to identify an expensive rule expression.
Alerting is an art. One must be sure to alert just enough to be aware of all problems arising in the monitored system while at the same time not drown out the signal with excess noise. In this blogpost we'll explain some of the best practices to use when alerting with Prometheus.
If your applications are restarting regularly, whether due to segfaults or OOMs, it'd be nice to know.
One of the major changes introduced in Prometheus 2.0 was that of staleness handling. Previously for instant vectors, Prometheus would return a point up to 5 minutes in the past which caused a number of different issues.
In this blogpost we try and clear up some confusion by outlining the key differences between commonly confused alerting configuration options: group_interval
, group_wait
, and repeat_interval
.
Usually alert thresholds are hardcoded in the alert. In more sophisticated setups, it would be useful for it to be parameterised based on another time series.
At what point should you consider an alert resolved?
While the irate()
function is useful for granular graphs, it is not suitable for alerting.