Some questions on the opentelemetry-collector for machine charms

erik-lonroth · 13 November 2025 18:07

I’m testing out the opentelemetry-collector for machine charms.

Current cycle: “channel=2/candidate, rev=77”

I have implemented a rudimentary Loki charm, using the lib lib/charms/loki_k8s/v1. This charm is not yet released but works with a patch to the lib.

The relevant passage is here:

        self.lokiprovider = LokiPushApiProvider(self,
                                                port=external_url.port,
                                                scheme=external_url.scheme,
                                                address=external_url.hostname,
                                                path=f"{external_url.path}/loki/api/v1/push")

So all good, but I have a few questions and observations for the maintainers of the lib and opentelemetry-collector:

Our cloud is LXD - so this would matter.

We are getting these errors which seems related to some configuration in either Loki or opentelemetry-collector, we have tried various settings but can’t seem to get rid of it, any clues here? We are not ingesting alot.

Nov 07 12:14:45 juju-e0574e-0 loki[22350]: level=error ts=2025-11-07T12:14:45.83051444Z caller=manager.go:49 component=distributor path=write msg="write operation failed" details="ingestion rate limit exceeded for user fake (limit: 4194304 bytes/sec) while attempting to ingest '100' lines totaling '24120' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased" org_id=fake

The node-exporter seems unhappy, this seems related to the snap (Bug report here) (which might be a LXD issue) - Can we turn off the node-exporter somehow in the charm/snap ? (we are running a separate node-exporter so this thing is not needed for us)

2025-11-13T17:44:30.799754+00:00 juju-67bc6e-0 node-exporter.node-exporter[1926175]: time=2025-11-13T17:44:30.799Z level=ERROR source=collector.go:168 msg="collector failed" name=logind duration_seconds=0.150347986 err="unable to get seats: An AppArmor policy prevents this sender from sending this message to this recipient; type=\"method_call\", sender=\":1.41393\" (uid=0 pid=1926175 comm=\"/snap/node-exporter/1904/bin/node_exporter --colle\" label=\"snap.node-exporter.node-exporter (enforce)\") interface=\"org.freedesktop.login1.Manager\" member=\"ListSeats\" error name=\"(unset)\" requested_reply=\"0\" destination=\"org.freedesktop.login1\" (uid=0 pid=293 comm=\"/usr/lib/systemd/systemd-logind\" label=\"unconfined\")"

Where is the “juju_unit” topology item coming from the opentelemetry-collector? Its not in grafana (We get almost everything else, but the juju_unit isn’t there)
Screenshot from 2025-11-13 18-51-40891×486 69.3 KB
We don’t want to ship all the /var/log as this would start filling up the Loki. So, how can we filter out stuff on the local unit, based on for example our service, snap.polkadot.service) at the node? The service doesn’t have a logfile other than in /var/log/syslog.
We usually hope to monitor the principal application to which opentelemetry-collector is related. But the principal charm information is nowhere to be found. How can we get things from the principal charm - at the moment, everything is labeled opentelemetry-collector charm.
image891×486 63.9 KB

sed-i · 13 November 2025 19:53

Hey @erik-lonroth

(1) For Loki’s “ingestion rate limit exceeded”, you can juju config a higher ingestion limit.

(2) Re login1.Manager, the login-observe does not auto-connect, we need to add it to the charm.

(3) juju_unit is missing by design, but the logs from /var/log get the instance label. We should probably add the instance label to logs coming from the log slots.

(4) You can use the path_exclude config option.

(5) We are working on adding an info metric to address this.

erik-lonroth · 13 November 2025 21:37

I have tried to increase this value to over 64MB at the Loki side of this, without getting rid of this. I suspect something else is going on. I can’t understand what goes on. Loki experts hello?

Right, ok. We will have to wait for this then.

I think this is wrong. Not providing a full topology breaks the contract promised by the whole idea of the juju topology. I would expect from working with the juju topology as a means to easily correlate problems which occurs in our environment based on this topology.

If we can’t get this, we would need to patch this in ourselves somehow, which I guess would be a fork of the whole telemetry-collector, which must be wrong.

See my comment on this here: `instance` label missing from logs collected from the log slots · Issue #123 · canonical/opentelemetry-collector-operator · GitHub

How would I use the path_exclude to filter our anything coming from a specific systemd-unit from /var/log/syslog? The docs for this is here and do not say anything about a specific systemd-unit. I don’t think that is possible?

Would our only option be to patch the systemd-unit-file to send logs to a specific file? It would be much better to provide a means to target a specific service on the host. (New feature?)

I see. I couldn’t fully understand what this would imply. I was actually trying at some point to do this myselfproviding a different relation to send data over to the subordinate, but I never got that all the way. But I really think this feature is needed since it make little sense to “monitor” the telemetry-collector itself when the purpose generally is to monitor the principal workload.

I am not trying to monitor the opentelemetry-collector after all. I’m trying to monitor my principal workload and get labels for this in Loki.

sed-i · 14 November 2025 00:08

otelcol (and gagent) are subordinate charms, which means that if multiple principals are deployed --to the same machine, and if they are related to the same otelcol app, then multiple otelcol units will be deployed to that same machine as well.

We do not want to duplicate all the logs just to be able to slap e.g. juju_unit="otelcol/0" to one copy and juju_unit="otelcol/1" to another copy. Instead, we have only one copy of the logs, without juju_unit, but with instance.

Also, why would you want e.g. postgres logs labelled with otelcol unit? In any case, the instance label is somewhat equivalent to juju_unit.

sed-i · 14 November 2025 00:15

You can’t do that with path exclude, path exclude is for particular files scraped from /var/log/*.

To filter out per systemd unit file you could try providing your own processor config.

https://documentation.ubuntu.com/observability/latest/how-to/selectively-drop-telemetry/#drop-logs

erik-lonroth · 14 November 2025 10:44

I would find it very useful to be able to have a path_include config in this case as it would be a super clean way to track a single application, which is usually what I want.

Its kind of aggressive in my mind to default assume that EVERYTHING should be sent from a host/unit rather than be somewhat conservative. After all, filling up Loki with logs is expensive…

Why not start small with /var/log/syslog and expand from that?

I started a feature request: Add a path_include option · Issue #124 · canonical/opentelemetry-collector-operator · GitHub