Using PromQL you can combine metrics for analysis.
It's not unknown that a new kernel version can introduce subtle problems, which can be difficult to spot on a machine by machine basis. Let's say that you suspected that something was up in relation to TCP sockets in the TIME_WAIT state. The number of such sockets is covered by the node_sockstat_TCP_tw
metric.
So you can do:
avg without (instance)( node_sockstat_TCP_tw * on(instance) group_left(release) node_uname_info )
What this does is add the release
label (the kernel version) to the node_sockstat_TCP_tw
metric based on the instance label for node_uname_info
being the same. Finally we average the values, ignoring the instance
label - which should produce a per kernel version result.
This is useful only as long as all the kernel versions are seeing about the same load. Let's say we wanted to normalise by the number of in-use TCP sockets, we can do:
avg without (instance)( node_sockstat_TCP_tw / node_sockstat_TCP_inuse * on(instance) group_left(release) node_uname_info )
While taking an average of a pile of ratios is dubious statistically, this could help spot at a very high level if there's a potential issue.
Unsure how to do something in PromQL? Contact us.
No comments.