Failed requests are a fact of life, network weirdness and machine failures are inevitable. It can be tempting to simply retry the request when this happens, but this may cause more harm than good.
A blog on monitoring, scale and operational Sanity
Failed requests are a fact of life, network weirdness and machine failures are inevitable. It can be tempting to simply retry the request when this happens, but this may cause more harm than good.
When starting out it's easy to think that you need Docker, Kubernetes, Microservices, Continuous Deployment and all the other trending topics on Hacker News/Reddit/Lobsters. What do you really need?
Your service's traffic is steadily growing, latency has increased a bit but it's within reason. One day you launch a new customer and the latency jumps through the roof causing an outage. What happened? You hit the knee.
I enjoy cooking and regularly make scrumptious meals for myself. Does this mean that I'm capable of running a busy kitchen? Of course not! So why assume that all software engineers can automatically run production services?
Systems such as Consul perform healthchecking of local services and expose this information to other machines within the cluster. Does this mean that the service will work when you try to talk to it?
It's easy to get carried away by the power of labels with Prometheus. In the extreme this can overload your Prometheus server, such as if you create a time series for each of hundreds of thousands of users. Thankfully there's a way to deal with this without having to turn off monitoring or deploy a new version of your code.
Traffic from users to your servers isn't a steady stream, it waxes and wanes over the day and week. The peak-to-mean ratio is your primary tool to avoid outages or unnecessary costs due to this.
The most common way to learn about the expiry date of your website's SSL certificate is after it has expired. The blackbox exporter combined with Prometheus can let you know well in advance, letting you renew your certificate before users complain.
Caches are a common feature of distributed systems, often added to improve performance. There are three main types of cache, and knowing about them will help you design robust systems.