Solving high cardinality labels in Loki

jose · 6 September 2024 05:15

Introduction

Contrary to what happens with indexes in SQL databases like MySQL or PostgreSQL, high cardinality is something to avoid when talking about labels in Loki.

In SQL databases, we look for indexes to have high cardinality because this improves query efficiency by allowing the index to quickly filter the results. So, while in Loki this can be a problem, in SQL it can be an advantage for optimising data access.

Having high cardinality in labels is problematic because it means there are too many unique values for a specific label, significantly increasing the number of streams that need to be stored and managed. This can lead to higher memory usage, impact query performance, and cause storage overload and make it difficult to scale the system efficiently. In short, high cardinality can overwhelm system resources and slow down its overall operation.

How Loki uses labels

According to Loki documentation:

Labels in Loki perform a very important task: They define a stream. More specifically, the combination of every label key and value defines the stream. If just one label value changes, this creates a new stream.

If you are familiar with Prometheus, the term used there is series; however, Prometheus has an additional dimension: metric name. Loki simplifies this in that there are no metric names, just labels, and we decided to use streams instead of series.

When we can face a high cardinality situation?

@FaQ was very clear when he explained his situation:

"When using Loki to ingest logs from an Openstack cloud, the inclusion of the filename label for each log line can potentially lead to a huge cardinality.

Openstack uses libvirt to spawn VMs, and libvirt creates one log file for each VM launched (/var/log/libvirt/qemu/$DOMAIN_NAME.log). Some clouds used as build farms, can spawn thousands of VMs in the course of a few days, each with it’s own value for the filename label."

Also we should avoid labels that include process_id, thread_id or Kubernetes pod names.

Is it possible to address this situation in Loki + Grafana agent charms?

The short answer is: Yes, we need structured_metadata

The long answer is still Yes, but the devil is in the details

Loki

Since Loki 2.9.4 the structured_metadata feature is GA.

Structured metadata is a way to attach metadata to logs without indexing them or including them in the log line content itself. Examples of useful metadata are kubernetes pod names, process ID’s, or any other label that is often used in queries but has high cardinality and is expensive to extract at query time.

So it feels like it’s what we need in our OpenStack situation described by @FaQ !

In order to enable that feature in Loki we need to do two things:

Enable allow_structured_metadata in the config file:

limits_config:
  allow_structured_metadata: true

Make sure the schema_config we are using is at least v13:

schema_config:
  configs:
  - from: '2024-09-03'
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v13
    store: tsdb

We have a WIP PR for this.

Grafana Agent

Once structure_metadata is enabled in Loki, we need to scrape and send logs including the high cardinality label as structured_metadata instead of a regular label.

In order to do that, we need to modify the grafana-agent scrape_config definition with something like this:

    scrape_configs:
    - job_name: varlog_scraper
      pipeline_stages:
      - drop:
          expression: .*file is a directory.*
      - structured_metadata:
          filename: filename
      - labeldrop:
        - filename
      static_configs:
      - labels:
          __path__: /var/log/**/*log
          instance: juju-ccf9c7-7.lxd
          job: varlog_scraper
          juju_application: agent
          juju_charm: grafana-agent
          juju_model: apps
          juju_model_uuid: c157bd5e-8306-40a9-8307-b00c63ccf9c7
          juju_unit: agent/13
        targets:
        - localhost

The important part of this piece of config is pipeline_stages. Let’s go one by one:

drop stage:
- This stage filters out any log entries that match the regular expression .*file is a directory.*. If the log message contains this pattern, it is discarded, preventing unnecessary logs from being sent to Loki. (Note that this stage is already in grafana-agent charm.)
structured_metadata stage:
- In this stage, the filename label is added to the structured metadata, associating the log entry with the corresponding log file name. This allows the filename to be stored in the log’s metadata without indexing it as a label, helping reduce cardinality.
labeldrop stage:
- The labeldrop step removes the filename label after it has been processed and added to structured metadata. This ensures that the label doesn’t contribute to high cardinality by being indexed.

Each step processes log entries sequentially, performing specific actions to control what information is kept, how it’s stored, and which parts are discarded.

How do we see this in Grafana?

As we saw in grafana-agent configuration we scrape all the files that matches the regex /var/log/**/*log, for instance /var/log/pepe.log.

So in order to search for logs with the job="varlog_scraper" label, and the filename metadata /var/log/pepe.log we need to execute this LogQL query:

{job="varlog_scraper"} | filename="/var/log/pepe.log"

to get this in Grafana:

Some important comments should be made about this:

Although filename key seems to be part of the regular (indexed) labels it is not, because we are not including it inside the { }, we use instead |. At this point Grafana UI is not clear enough to let us know at glance which labels are indexed and which ones are part of structured_metadata.

We can verify this by trying to use filename as a regular label by executing:

{job="varlog_scraper", filename="/var/log/pepe.log"}

The result of that query is an empty set:

Final thoughts

The decision to move the filename label into structured_metadata rather than indexing it is crucial for reducing cardinality in Loki. As we said before indexing labels like filename, would lead to an overwhelming number of streams. This not only affects Loki’s storage efficiency but also impacts query performance, making the system harder to scale. By storing filename as metadata, we retain the ability to filter logs without unnecessarily inflating the cardinality.

Providing Juju administrators flexibility in defining scrape_configs is important, but it’s also key to follow best practices to avoid performance issues. Indexing highly variable labels, like filenames, can increase cardinality and strain resources. Using structured metadata for such labels allows for effective log filtering while maintaining a scalable, efficient Loki deployment. This ensures a responsive system that balances flexibility with performance.