Cross Model Integration COS light lxd-plus-microk8 makes juju error

erik-lonroth · 22 May 2023 22:41

I did setup a COS light model today described here: Metallb won't assign IP-addresses to services after reboot - charm - Charmhub

Once that model was OK, I added the microk8 cloud to a second lxd-controller (name: localhost-localhost) with juju add-cloud

As you can see below, the juju controller happily accepted this new cloud.

Screenshot from 2023-05-22 23-59-26

From this, my intention is to write a machine charm that can be related to the COS stack. So I put together this charm https://github.com/erik78se/juju-operators-examples/tree/main/observed (which at this point in time only supports prometheus interface).

The guide I followed was more or less this: Charmhub | Using the Grafana Agent Machine Charm

Side note: Many things isn’t working in it. I’m happy to help to fix it @tmihoc

I deployed the packed (observed) charm in a new model in the lxd-cloud (localhost-localhost) along with the grafana agent from the channel edge (This is all from the guide):

juju deploy ./observed.charm
juju deploy grafana-agent --channel edge
juju relate observed:cos-agent grafana-agent

I then offer the COS-light prometheus - then - consume it and relate.

The experiment is so far successful, however, it seems that my charm isn’t working - so I break the relation to the offered prometheus. This is where things starts to get messy and I can’t exactly account for the exact steps that gets me to a situation where my COS-light model is completely ERROR.

But, in general I think I removed the relation.

juju remove-relation prometheus grafana-agent

… and then I tried to offer the prometheus relation like this:

juju offer prometheus:send-remote-write

Somehow I now can’t interact with the COS light model (coslight2)

But I still can work with other models like normal.

When I look into the controller (/var/log/logsink.log) it throws out alot of this

485b9a87-e232-419a-8139-6829f0940572: machine-0 2023-05-22 22:36:58 INFO juju.state.allwatcher allwatcher.go:1823 allwatcher loaded for model "6cd3ef03-323e-4bf8-8b06-1bd7fed503a5" in 41.018185ms 
485b9a87-e232-419a-8139-6829f0940572: machine-0 2023-05-22 22:36:58 INFO juju.state.allwatcher allwatcher.go:1823 allwatcher loaded for model "485b9a87-e232-419a-8139-6829f0940572" in 9.157426ms 
485b9a87-e232-419a-8139-6829f0940572: machine-0 2023-05-22 22:36:58 ERROR juju.worker.modelcache worker.go:373 watcher error: error loading entities for model 6cd3ef03-323e-4bf8-8b06-1bd7fed503a5: failed to initialise backing for applicationOffers:prometheus: getting relation endpoint for relation "send-remote-write" and application "prometheus": application "prometheus" has no "send-remote-write" relation, getting new watcher 
485b9a87-e232-419a-8139-6829f0940572: machine-0 2023-05-22 22:36:58 INFO juju.state.allwatcher allwatcher.go:1823 allwatcher loaded for model "bd164989-ad1a-40ff-8980-b76b1a192d19" in 24.595887ms 
485b9a87-e232-419a-8139-6829f0940572: machine-0 2023-05-22 22:36:58 INFO juju.state.allwatcher allwatcher.go:1823 allwatcher loaded for model "6cd3ef03-323e-4bf8-8b06-1bd7fed503a5" in 25.40013ms

I hope this is helpful for figuring out what goes on and I wonder if this is something to do with how I use the lxd controller to accommodates a microk8 cluster intended for COS light.

Perhaps we MUST use separate controllers for microk8 vs lxd to make all this work (@0x12b) ?

@wallyworld

erik-lonroth · 23 May 2023 22:05

[UPDATE]

I tried just now to use my microk8 cloud now as described above to add a new model to deploy a new COS stack.

But juju is not happy at all. Preventing me from doing this. @wallyworld

erik-lonroth · 24 May 2023 12:04

@wallyworld I’m kind of in a blocked state from developing this further and I wonder if I should tear down my environment before you had a chance to look at the situation on my end?

erik-lonroth · 25 May 2023 17:53

So, my controller is totally dead and I need to tear it down. This is the last log from the controller restart (which doesn’t recover or start).

tail -f /var/log/juju/machine-0.log

2023-05-25 17:52:01 INFO juju.state.allwatcher allwatcher.go:1823 allwatcher loaded for model "f720b50a-0ff1-4264-855b-3969806d415e" in 7.533894ms
2023-05-25 17:52:01 ERROR juju.worker.modelcache worker.go:373 watcher error: error loading entities for model 6cd3ef03-323e-4bf8-8b06-1bd7fed503a5: failed to initialise backing for applicationOffers:prometheus: getting relation endpoint for relation "send-remote-write" and application "prometheus": application "prometheus" has no "send-remote-write" relation, getting new watcher
2023-05-25 17:52:01 INFO juju.state.allwatcher allwatcher.go:1823 allwatcher loaded for model "6cd3ef03-323e-4bf8-8b06-1bd7fed503a5" in 39.683876ms
2023-05-25 17:52:01 INFO juju.state.allwatcher allwatcher.go:1823 allwatcher loaded for model "485b9a87-e232-419a-8139-6829f0940572" in 13.184189ms
2023-05-25 17:52:01 INFO juju.state.allwatcher allwatcher.go:1823 allwatcher loaded for model "f720b50a-0ff1-4264-855b-3969806d415e" in 8.064403ms

cat /var/snap/juju-db/common/logs/mongodb.log.2023-05-25T17-49-16

{"t":{"$date":"2023-05-25T17:49:15.455+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2023-05-25T17:49:15.456+00:00"},"s":"I",  "c":"NETWORK",  "id":4648601, "ctx":"main","msg":"Implicit TCP FastOpen unavailable. If TCP FastOpen is required, set tcpFastOpenServer, tcpFastOpenClient, and tcpFastOpenQueueSize."}
{"t":{"$date":"2023-05-25T17:49:15.528+00:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":25175,"port":37017,"dbPath":"/var/snap/juju-db/common/db","architecture":"64-bit","host":"juju-940572-0"}}
{"t":{"$date":"2023-05-25T17:49:15.528+00:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.18","gitVersion":"8ed32b5c2c68ebe7f8ae2ebe8d23f36037a17dea","openSSLVersion":"OpenSSL 1.1.1f  31 Mar 2020","modules":[],"allocator":"tcmalloc","environment":{"distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2023-05-25T17:49:15.528+00:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Ubuntu","version":"20.04"}}}
{"t":{"$date":"2023-05-25T17:49:15.528+00:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"config":"/var/snap/juju-db/common/juju-db.config","net":{"bindIp":"*","ipv6":true,"port":37017,"tls":{"certificateKeyFile":"/var/snap/juju-db/common/server.pem","certificateKeyFilePassword":"<password>","mode":"requireTLS"}},"operationProfiling":{"slowOpThresholdMs":1000},"replication":{"oplogSizeMB":1024,"replSet":"juju"},"security":{"authorization":"enabled","keyFile":"/var/snap/juju-db/common/shared-secret"},"storage":{"dbPath":"/var/snap/juju-db/common/db","engine":"wiredTiger","journal":{"enabled":true}},"systemLog":{"destination":"file","path":"/var/snap/juju-db/common/logs/mongodb.log","quiet":true}}}}
{"t":{"$date":"2023-05-25T17:49:15.529+00:00"},"s":"I",  "c":"STORAGE",  "id":22297,   "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}
{"t":{"$date":"2023-05-25T17:49:15.530+00:00"},"s":"I",  "c":"STORAGE",  "id":22315,   "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=15495M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}}
{"t":{"$date":"2023-05-25T17:49:15.866+00:00"},"s":"I",  "c":"CONTROL",  "id":23377,   "ctx":"SignalHandler","msg":"Received signal","attr":{"signal":15,"error":"Terminated"}}
{"t":{"$date":"2023-05-25T17:49:15.866+00:00"},"s":"I",  "c":"CONTROL",  "id":23378,   "ctx":"SignalHandler","msg":"Signal was sent by kill(2)","attr":{"pid":1,"uid":0}}
{"t":{"$date":"2023-05-25T17:49:15.866+00:00"},"s":"I",  "c":"CONTROL",  "id":23381,   "ctx":"SignalHandler","msg":"will terminate after current cmd ends"}
{"t":{"$date":"2023-05-25T17:49:15.866+00:00"},"s":"I",  "c":"REPL",     "id":4784900, "ctx":"SignalHandler","msg":"Stepping down the ReplicationCoordinator for shutdown","attr":{"waitTimeMillis":10000}}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"COMMAND",  "id":4784901, "ctx":"SignalHandler","msg":"Shutting down the MirrorMaestro"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"SHARDING", "id":4784902, "ctx":"SignalHandler","msg":"Shutting down the WaitForMajorityService"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"NETWORK",  "id":20562,   "ctx":"SignalHandler","msg":"Shutdown: going to close listening sockets"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"NETWORK",  "id":4784905, "ctx":"SignalHandler","msg":"Shutting down the global connection pool"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"STORAGE",  "id":4784906, "ctx":"SignalHandler","msg":"Shutting down the FlowControlTicketholder"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"-",        "id":20520,   "ctx":"SignalHandler","msg":"Stopping further Flow Control ticket acquisitions."}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"REPL",     "id":4784907, "ctx":"SignalHandler","msg":"Shutting down the replica set node executor"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"NETWORK",  "id":4784918, "ctx":"SignalHandler","msg":"Shutting down the ReplicaSetMonitor"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"SHARDING", "id":4784921, "ctx":"SignalHandler","msg":"Shutting down the MigrationUtilExecutor"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"CONTROL",  "id":4784925, "ctx":"SignalHandler","msg":"Shutting down free monitoring"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"STORAGE",  "id":4784927, "ctx":"SignalHandler","msg":"Shutting down the HealthLog"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"STORAGE",  "id":4784929, "ctx":"SignalHandler","msg":"Acquiring the global lock for shutdown"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"-",        "id":4784931, "ctx":"SignalHandler","msg":"Dropping the scope cache for shutdown"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"FTDC",     "id":4784926, "ctx":"SignalHandler","msg":"Shutting down full-time data capture"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"CONTROL",  "id":20565,   "ctx":"SignalHandler","msg":"Now exiting"}
{"t":{"$date":"2023-05-25T17:49:15.867+00:00"},"s":"I",  "c":"CONTROL",  "id":23138,   "ctx":"SignalHandler","msg":"Shutting down","attr":{"exitCode":0}}

wallyworld · 29 May 2023 02:43

Sorry for the delayed reply. Separate controllers should not be necessary. Also, that “state changing too quickly” error is a definite bug. We’ll need to reproduce what you’ve done and see if we can hit the same errors and try and figure out what’s wrong.

barrettj12 · 29 May 2023 05:09

Hi Erik,

I tried to reproduce this issue today on my machine. The first thing I noticed was that the ‘install’ hook failed for your observed charm, because the cos_agent library couldn’t find the pydantic package. The fix is to add pydantic to your charm’s requirements.txt.

After doing that, I re-packed the charm and continued trying to reproduce the bug.

Reproduction steps

# Juju version 2.9.42 installed from snap
# MicroK8s v1.26.4 revision 5219
# LXD version 5.13-8e2d7eb
# OS: Ubuntu 20.04.6 LTS

juju bootstrap microk8s cos
juju add-model cos
curl -L https://raw.githubusercontent.com/canonical/cos-lite-bundle/main/overlays/offers-overlay.yaml -O
juju deploy cos-lite --channel=edge --trust --overlay ./offers-overlay.yaml
# wait for COS Lite to stabilise

juju bootstrap lxd lxd
juju add-cloud microk8s --controller lxd

# clone https://github.com/erik78se/juju-operators-examples
# charmcraft pack the 'observed' charm
juju deploy ./observed.charm
juju deploy grafana-agent --channel edge
juju relate observed:cos-agent grafana-agent
# wait for things to spin up

# consume COS Lite relation offers
juju consume cos:admin/cos.prometheus-receive-remote-write
juju consume cos:admin/cos.loki-logging
juju consume cos:admin/cos.grafana-dashboards
# and relate them to grafana-agent
juju relate grafana-agent prometheus-receive-remote-write
juju relate grafana-agent loki-logging
juju relate grafana-agent grafana-dashboards

At this point, I tried some of the things you suggested about removing/re-adding the prometheus relation, ~~but was not able to get any of the messages you saw.~~ (see EDIT below)

$ juju switch lxd
cos:admin/cos -> lxd:admin/default
$ juju remove-relation grafana-agent prometheus-receive-remote-write
$ juju switch cos
lxd:admin/default -> cos:admin/cos
$ juju offer prometheus:send-remote-write
ERROR cannot add application offer "prometheus": getting relation endpoint for relation "send-remote-write" and application "prometheus": application "prometheus" has no "send-remote-write" relation
$ juju offer prometheus:receive-remote-write
Application "prometheus" endpoints [receive-remote-write] available at "admin/cos.prometheus"
$ juju offer prometheus:send-remote-write
ERROR cannot update application offer "prometheus": getting relation endpoint for relation "send-remote-write" and application "prometheus": application "prometheus" has no "send-remote-write" relation

~~You say~~

~~I can’t exactly account for the exact steps that gets me to a situation where my COS-light model is completely ERROR.~~

~~Unfortunately, I think we would need to know the exact steps in order to reproduce your bug, and work out what’s going on.~~

Based on your original post, this is what I think happened. Somehow, you ran the command

juju offer prometheus:send-remote-write

which should have returned an error, telling you the send-remote-write endpoint doesn’t exist. However, for some reason, the usual sanity checks didn’t run here, and this operation was sent to the Juju controller. This puts the database in a broken state - where Juju believes that a prometheus:send-remote-write offer exists, but the prometheus app doesn’t have any such endpoint.

You can see above that when I tried to do this, the juju offer command returned an error as expected:

$ juju offer prometheus:send-remote-write
ERROR cannot add application offer "prometheus": getting relation endpoint for relation "send-remote-write" and application "prometheus": application "prometheus" has no "send-remote-write" relation

I further suspect this issue has nothing to do with COS Lite at all. This could have happened for any application and relation.

EDIT: after writing this post, I went back to my Juju model and noticed exactly the same error that Erik mentioned. Will investigate further.

barrettj12 · 29 May 2023 06:16

@erik-lonroth your issue appears to be an instance of

When you redefine an offer to an invalid endpoint name, this is entered in the DB when it shouldn’t be. This breaks Juju badly as it puts the model DB in an inconsistent state.

The other issue is the “state changing too quickly” error, which I haven’t looked into, but it might be related to this bug:

erik-lonroth · 29 May 2023 06:16

Thanx! I’ll put it in there.

This is great and thanx for finding that error despite the not so exact instructions. Its scary indeed that my database was all broken after this procedure. Is there any way to repair it since I dare not put this into our production environment if there isn’t a way to repair the db…

Thanx alot for putting in the effort.

erik-lonroth · 29 May 2023 06:19

Could very much be. I’ve ran into this bug a few times when I was exploring CMR in the past and always thought CMR was somewhat in a beta state because all the errors I’ve encountered… Perhaps its just me an bad luck.

barrettj12 · 29 May 2023 06:21

It is repairable using “database surgery”. We’d have to remove the invalid offer from the applicationOffers table in the database. I don’t know the exact command off the top of my head, but we could work this out if needed.

However, we’ll try to fix this bug ASAP, so you don’t need to resort to DB surgery.

erik-lonroth · 29 May 2023 06:23

Totally. My juju controller didn’t wake up at all after the error. I purged it. Luckily just a test/lab from our end so this is indeed great if it is fixed. Thanx again.