Reliable Insights

A blog on monitoring, scale and operational Sanity

September 1, 2015

Alerting on Down Instances

Knowing which instances of your services and which machines in your fleet are no longer responding is a common requirement. Whether it's to get someone to investigate or to drive automation, in this post I'll look at how you can do it with Prometheus.

Read more

August 26, 2015

Conway’s Life in Prometheus

Some monitoring systems are very limited in what calculations you can do with them. Prometheus is not such a system, and today I'm happy to say that half a year after it publicly launched, Prometheus is Turing Complete.

Read more

August 23, 2015

There are 100,000 Seconds in a Day

Just after you've launched is not the best time to find out that you can't handle the load you predicted, or that running costs are much higher than you'd like. By estimating the operational parameters of your system as you design you can gain confidence that the system will work as you expect.

Read more

August 19, 2015

Viewing Logs for the JMX Exporter

Sometimes mBeans produce errors when scraped by the JMX exporter. Being able to look at detailed logs can help you figure out exactly which mBean is having issues and why.

Read more

August 19, 2015

Writing JSON Exporters in Python

A common question is is there a way to ingest JSON metrics from a random system into Prometheus? It's not possible to extract useful metrics from an arbitrary JSON blob, so that's not something the can be offered out of the box. However it's easy to write an exporter in Python to produce meaningful metrics.

Read more

August 14, 2015

Scaling and Federating Prometheus

A single Prometheus server can easily handle millions of time series. That's enough for a thousand servers with a thousand time series each scraped every 10 seconds. As your systems scale beyond that, Prometheus can scale too.

Read more

August 12, 2015

The Three Types of Cache

Caches are a common feature of distributed systems, often added to improve performance. There are three main types of cache, and knowing about them will help you design robust systems.

Read more

August 11, 2015

Adding Basic Auth to Prometheus with Nginx

Prometheus doesn't provide authentication support in order to focus energy on making an awesome monitoring tool. Instead users can take advantage of a more purpose designed tool such as Nginx to do so. This post will look at how you can do that.

Read more

August 8, 2015

Quick Sensor Metrics with the Textfile Collector

Sometimes you need a machine metric that's not exported yet by the node exporter. The textfile collector can be used to quickly get such metrics graphed in Prometheus.

Read more

August 7, 2015

Reduce Noise From Disk Space Alerts

How often have you gotten alerted about disk space going over some threshold, only to discover it'll be weeks or even months until the disk actually fills? Noisy alerts are bad alerts. The new predict_linear() function in Prometheus gives you a way to have a smarter, more useful alert.

Read more

twitter
youtube
linkedin

Blog   |   Training   |   Book   |   Privacy