How to tune Loki data log volumes/ingestion/retention in COS

We are in the process of deploying COS into production and are now looking for some information about tuning the data ingestion/retention with Loki in COS.

We have about 30TB of data in a microceph which need to tune ingestion of data to match to metrics and logs. But we also have a significant amount of ingesting units.

Our question is:

What can we do to affect the amount of logs that we keep or collect since we don’t have infinite of storage?

For example.

  • Can we set a max capacity level for logs - in lets say X TB?
  • Can we filter at the logging source effectively not ship at all?
  • Can we compress data?
  • Can we filter our logs in Loki? Eg. removing data?
  • etc.?

Note: This is likely something that anyone in the process of deploying COS into production would have to deal with. So, the question is kind of in the format that it should go into some upstream documentation for COS stacks itself.

Thanks for surfacing this @erik-lonroth.

  • The loki charm has config options for ingestion limit and retention period.
  • Size-based retention is something many people are interested in, but it was difficult to implement (1, 2).
  • Check out the sizing guide.
  • Promtail has a drop directive, and we can try to think how to expose it in the charm. Suggestions welcome!
    • Haven’t seen yet if dropping is possible on the receiving side (loki)
  • See also info about the REST endpoint for manual deletion
1 Like

Thanx back for a fantastic COS stack. We hope to get this running in a near future in production. Our journey here has been very long since we have had to battle multiple learning and design phases. @0x12b has been following us with this and I hope to some day be able to let you in on the various challenges, but also benefits of getting here.

We will study up on your suggestions, but also I think this needs to go into a operational manual or deployment guide for others that might follow in these footsteps.

Agreed. Currently listed here: Charmhub | Deploy Loki using Charmhub - The Open Operator Collection.

1 Like

@sed-i I was looking at the image and I thought to as if the picture you show isn’t actually what is going on with ingesting logs and metrics.

I’m attaching my current understanding of this although I think its also “grafana-agent” involved here.

Is my view on this wrong? Perhaps you could assist here in clarifying a bit more then since there are more components involved, right?

There are two things at play here:

  • The level of isolation depends on what you’re measuring. If you only want one workload’s logs, then it should be the only one related to loki. An alternative option is to relate the same workload to two different loki instances: one that is part of COS and within your complete context, and another loki that is dedicated to the measurement.
  • The original diagram is a relation diagram: the lines mean a juju relation. In your diagram, it looks like the lines are HTTP connections?

In any case, it would be interesting to compare results from the following two loki instances:

(Edit a copy of this diagram)

In this diagram, boxes are apps and lines are juju relations. Traefik is there, part of a COS bundle, but is besides the point.