Prometheus-k8s docs - Troubleshooting integrations

sed-i · 12 June 2024 16:57

After adding scrape or remote write relations to prometheus, it may take juju some time to settle the model. After the model is settled, some data you expect to have may be missing (scrape jobs / alert rules).

Checklist

juju status --relations includes all the relations you expect to have, and if you have cross-model relations, then the SAAS section has a non-zero count for prometheus relations.

Rules and scrape targets are listed in prometheus

Query a prometheus unit ip to confirm all the rules are configured:

curl 10.1.207.168:9090/api/v1/rules | jq

and scrape jobs are healthy:

curl 10.1.207.168:9090/api/v1/targets \
  | jq '.data.activeTargets | .[] | {scrapeUrl, health}'

If something is missing, proceed to the next section.

Rules are present on prometheus filesystem

Rule files have topology information in their filename. Confirm you have everything you expect to find:

juju ssh --container prometheus prom/0 ls /etc/prometheus/rules

If something is missing, proceed to the next section.

Rules and scrape jobs are listed in relation data

To the relation data that is incoming into prometheus, you can use show-unit:

juju show-unit --format json prom/0 \
  | jq '."prom/0"."relation-info"'

To filter out all relations except the cross model relations,

juju show-unit --format json prom/0 \
  | jq '."prom/0"."relation-info" | .[] | select(."cross-model" == true)'

To further filter out all relations except the ones related to the “receive-remote-write” relation,

juju show-unit --format json prom/0 \
  | jq '."prom/0"."relation-info" | .[] | select(."cross-model" == true) | select(.endpoint == "receive-remote-write")'

To inspect all the scrape jobs coming in via the metrics-endpoint relation,

juju show-unit --format json prom/0 \
  | jq -r '."prom/0"."relation-info" | .[] | select(.endpoint == "metrics-endpoint")."application-data"."scrape_jobs"'

Similarly for alert rules:

juju show-unit --format json prom/0 \
  | jq -r '."prom/0"."relation-info" | .[] | select(.endpoint == "metrics-endpoint")."application-data"."alert_rules"'

For convenience you could use a function:

app_data () {
  # Usage examples:
  # app_data prom/0 receive-remote-write alert_rules
  # app_data prom/0 metrics-endpoint scrape_jobs

  # $1 = unit, e.g. prom/0
  local UNIT="$1"
  # $2 = endpoint, e.g. receive-remote-write
  local ENDPOINT="$2"
  # $3 = app relation data key, e.g. scrape_jobs (optional)
  local KEY="$3"

  if [[ $# -eq 3 ]]; then
    juju show-unit --format json $UNIT \
    | jq -r ".\"$UNIT\".\"relation-info\" | .[] | select(.endpoint == \"$ENDPOINT\").\"application-data\".\"$KEY\""
  elif [[ $# -eq 2 ]]; then
    juju show-unit --format json $UNIT \
    | jq -r ".\"$UNIT\".\"relation-info\" | .[] | select(.endpoint == \"$ENDPOINT\").\"application-data\""
  else
    echo "Illegal number of parameters" >&2
  fi
}

`many-to-many matching not allowed: matching labels must be unique on one side`

This is a PromQL error message that shows up when an aggregation operation such as on produces timeseries with non-unique label set.

For example:

(ceph_pool_bytes_used{}) *on (pool_id) group_left(name)(ceph_pool_metadata{})

Checklist

No redundant telemetry relations in place. Make sure your application has only one of the following relations:
- grafana-agent:juju-info
- grafana-agent:cos-agent
- cos-proxy:prometheus-target