Telemetry labels in the grafana ecosystem

sed-i · 7 March 2023 03:43

Any application on any node may produce telemetry (e.g. metrics, logs). When telemetry from multiple sources is stored in a centralized database, we need to be able to differentiate telemetry by source (origin). This is accomplished with telemetry labels.

A telemetry label is a key-value pair. Telemetry labels can be specified:

in the telemetry items themselves
in ingestion jobs (“scrape configs”)

Telemetry labels are used throughout the Grafana ecosystem.

Metric labels

An app may expose labelled metrics under a /metrics endpoint . A simple way to see this in action is to find an instrumented app and curl its /metrics endpoint. One such app is prometheus:

$ sudo snap install prometheus

$ curl localhost:9090/metrics

# -- snip --

# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 14

# -- snip --

# HELP prometheus_http_requests_total Counter of HTTP requests.
# TYPE prometheus_http_requests_total counter
prometheus_http_requests_total{code="200",handler="/metrics"} 128
prometheus_http_requests_total{code="302",handler="/"} 1

# ...

In the example above,

process_open_fds is a metric without any labels
prometheus_http_requests_total is a metric with two labels

Scrape job labels for metrics

While metric labels are set by the app developer, the monitoring service can append an additional fixed set of labels to all the metrics scraped by the same scrape jobs. Prometheus and grafana agent are two examples of monitoring services capable of scraping metrics.

For prometheus (or grafana agent) to scrape our apps (targets), we need to specify in its configuration file where to find them. This is also where we specify telemetry labels.

scrape_configs:
  - job_name: "some-app-scrape-job"
    metrics_path: "/metrics"
    static_configs:
      - targets: ["hostname.for.my.app:8080"]
        labels:
          location: "second_floor_third_server_from_the_left"
          purpose: "weather_station_cluster"

Labels that are specified under a static_configs entry are automatically “appended” to all metrics scraped from the targets:

$ curl -s --data-urlencode 'match[]={__name__="prometheus_http_requests_total"}' localhost:9090/api/v1/series | jq '.data'
[
  {
    "__name__": "prometheus_http_requests_total",
    "code": "200",
    "handler": "/metrics",
    "instance": "localhost:9090",
    "job": "prometheus",
    "location": "second_floor_third_server_from_the_left",
    "purpose": "weather_station_cluster"
  },
  {
    "__name__": "prometheus_http_requests_total",
    "code": "302",
    "handler": "/",
    "instance": "localhost:9090",
    "job": "prometheus",
    "location": "second_floor_third_server_from_the_left",
    "purpose": "weather_station_cluster"
  },
]

Similarly, “service labels” can be specified using prometheus remote-write endpoint and push-gateway, and grafana agent’s config file.

Log labels

Logs (“streams”) ingested by loki will be searchable by the specified labels. If you push logs directly to loki, you can attach labels to to every “stream” pushed. In loki’s terminology, a stream is a set of loglines pushed in a single request:

{
  "streams": [
    {
      "stream": {
        "label": "value"
      },
      "values": [
          [ "<unix epoch in nanoseconds>", "<log line>" ],
          [ "<unix epoch in nanoseconds>", "<log line>" ]
      ]
    }
  ]
}

Scrape job labels for logs

Log files can be scraped by promtail or grafana agent, which then stream the log lines to loki using loki’s push api endpoint. Promtail, similar to grafana agent, has a scarpe_configs section in its config file for specifying targets (log filename) and associate labels to them. See also grafana agent’s config file docs.

Alert labels

By design, prometheus (and loki) store all alerts in a centralized fashion: if you want your alerts to be evaluated, you must place them on the filesystem somewhere accessible by prometheus, and specify that path in prometheus’s config file:

rule_files:
  - /path/to/*.rules
  - /another/one/*.yaml

Alert definitions are not tied to any particular node, application or metric. This gives high flexibility in defining an alert. You could define an alert that triggers for any node that runs out of space, and another alert that triggers only for a specific application on a specific node. Narrowing down the scope of an alert is accomplished by using telemetry labels.

expr: process_cpu_seconds_total > 0.12 would trigger if the value of any metric with this name (regardless of any labels) exceeds 0.12.
expr: process_cpu_seconds_total{region="europe", app="nginx"} > 0.12 would trigger only for this metrics that is also labeled as nginx and europe.

When an on-caller receives an alert (via alertmanager, karma or similar), they see a rendering of the alert, which includes the expr and label values, among a few additional fields.

Additional alert labels can be specified in the alert definition:

      labels:
        severity: critical

This is useful for:

Filtering alert rules (see grouping, inhibition, silences).
Enriching the message an on-caller sees with additional metadata.

Relabeling

relabel_configs and metric_relabel_configs are for modifying label and metric names, respectively.