Alerting on numbers being too big or small is easy with Prometheus. But what if the numbers go missing?
A blog on monitoring, scale and operational Sanity
Alerting on numbers being too big or small is easy with Prometheus. But what if the numbers go missing?
Sometimes you want the raw samples inside Prometheus for analysis or debugging. How do you get that?
Since Prometheus 2.1 there is a feature to view alerting rule evaluation times in the rules UI. In this blogpost we'll see an example of how this can be used to identify an expensive rule expression.
When using the count
aggregation operator you may have noticed that it sometimes returns nothing rather than 0. Why is this?
If your applications are restarting regularly, whether due to segfaults or OOMs, it'd be nice to know.
One of the major changes introduced in Prometheus 2.0 was that of staleness handling. Previously for instant vectors, Prometheus would return a point up to 5 minutes in the past which caused a number of different issues.
Have you ever wondered what percentage of time a given service or application spends up or down?
Prometheus 2.0 brought with it rule groups, making hierarchical aggregation easier than ever.
Have you ever wondered why the buckets in histograms are not just counters of events that fall into each bucket?
Usually alert thresholds are hardcoded in the alert. In more sophisticated setups, it would be useful for it to be parameterised based on another time series.