It's easy to get carried away by the power of labels with Prometheus. In the extreme this can overload your Prometheus server, such as if you create a time series for each of hundreds of thousands of users. Thankfully there's a way to deal with this without having to turn off monitoring or deploy a new version of your code.
Firstly you need to find which metric is the problem. Go to the expression browser on Prometheus (that's the /graph endpoint) and evaluate topk(20, count by (__name__, job)({__name__=~".+"}))
. This will return the 20 biggest time series by metric name and job, which one is the problem should be obvious.
Now that you know the name of the metric and the job it's part of, you can modify the job's scrape config to drop it. Let's say it's a metric called my_too_large_metric
. Add a metric_relabel_configs
section to drop it:
scrape_configs: - job_name: 'my_job' static_configs: - targets: - my_target:1234 metric_relabel_configs: - source_labels: [ __name__ ] regex: 'my_too_large_metric' action: drop
All the samples are still pulled from the job being scraped, so this should only be a temporary solution until you can push a fixed version of your code.
No comments.