Does it really have to be perfect?
A blog on monitoring, scale and operational Sanity
Does it really have to be perfect?
Metric names are part of a time series's identity, so shouldn't include information unrelated to identity.
Which of by/without and on/ignoring should you use?
Have you ever felt that a piece of software just isn't doing what you need?
Trying to improve alerting piecemeal can be difficult.
What happens when your clustered storage fails?
To avoid weirdness, write your files atomically.
Federation can be quite useful, but it's not replication.
Alert thresholds can be surprisingly tricky to get right.
Have you ever found yourself having to keep on updating and tweaking certain regexes in PromQL?