[COS] Can't get Prometheus alert rules right

I’m trying to get alert rules to work with grafana-agent (machine version).

In my charm, I have defined a rule like this:

alert: MicrosampleCallsTotalValueExceeded
expr: rate(microsample_calls_total[2m]) > 0.1
for: 1m
labels:
  severity: critical
annotations:
  summary: "Rate limit for microsample_calls_total exceeded in {{ $labels.juju_model }}/{{ $labels.juju_unit }}"
  description: >
    The microsample_calls_total counter has increased by more than 0.1 within a two-minute period.
    The value of microsample_calls_total is currently: {{ $value }}
    Model: {{ $labels.juju_model }} 
    Unit: {{ $labels.juju_unit }}
    Instance: {{ $labels.instance }}
    LABELS = {{ $labels }}

My charm integrates with COS lite just fine. But, the alert rule never fires. This is how the alert rule looks like in grafana after getting added by the integration.

The expression is not returning any values UNLESS I remove the “juju_charm” from the Expression. Then the expression generates something that looks like the below - which I assume would be something the alert rule would be able to act on?

This raises some questions as how I would create alert rules

  1. Why the “juju_charm” label/variable makes the expression break?
  2. How would I tune alerts to target a single unit if the grafana-agent lib manipulates my rule?

There is an ongoing discussion also about the docs for how to do all this also for Loki as a next step. Using the Grafana Agent Machine Charm

@jose @0x12b @tmihoc @marcus @narindergupta

2 Likes

We are working on resolving the issue with the juju_charm being added as we speak. Long story short it should not get injected for the grafana agent machine charm, as it will never be present on the metrics.

In the meanwhile, you should be able to work around the bug by adding the following selector to your alert rule expression:

juju_charm=~".*"

3 Likes

Any chance that the juju machine instance id could e referenced in the juju topology (for lxd its something like juju-df4rhu-0) or is the label “instance” what we have to work with?

It would be nice to be able to easily track down issues originated from specific lxc/vm ids.

This did not work for me :frowning:

What happens? What does the final alert rule end up looking like if what you provision through the charm is the following?

alert: MicrosampleCallsTotalValueExceeded
expr: rate(microsample_calls_total{juju_charm=~".*"}[2m]) > 0.1
for: 1m
labels:
  severity: critical
annotations:
  summary: "Rate limit for microsample_calls_total exceeded in {{ $labels.juju_model }}/{{ $labels.juju_unit }}"
  description: >
    The microsample_calls_total counter has increased by more than 0.1 within a two-minute period.
    The value of microsample_calls_total is currently: {{ $value }}
    Model: {{ $labels.juju_model }} 
    Unit: {{ $labels.juju_unit }}
    Instance: {{ $labels.instance }}
    LABELS = {{ $labels }}
1 Like

Yes it worked @0x12b!

No idea what went wrong the first time.

1 Like