Need help with a lost agent

codersquid · 10 January 2022 14:05

Hi, I have a juju environment with a subordinate charm that has a lost agent. I’ve restarted jujud on that machine and I’ve also restarted the unit service, but the agent is still lost. I need help trying to recover, but I don’t know what to do next. Any advice?

Here are some log outputs.

For the machine:

2022-01-09 18:49:48 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.50.122.5:40150->10.25.2.109:17070:
 read: connection reset by peer
2022-01-09 18:49:48 ERROR juju.api.watcher watcher.go:96 error trying to stop watcher: codec.ReadHeader error: error receiving message: read tcp 10.50.122.5:40150->10.25.2.109:17070: read: connection reset by peer
... this is repeated
2022-01-09 18:50:14 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: [73b7b8] "machine-2" cannot open api: unable to connect to API: read tcp 10.50.122.5:36076->10.
25.2.110:17070: read: connection reset by peer
2022-01-09 18:50:18 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: [73b7b8] "machine-2" cannot open api: unable to connect to API: dial tcp 10.25.2.109:17070: con
nect: connection refused

on restart I get the following:

2022-01-10 13:46:29 INFO juju.cmd supercommand.go:56 running jujud [2.9.21 0 8a154b7d629f6d9c0693aba7accf255789996c14 gc go1.14.15]
2022-01-10 13:46:29 DEBUG juju.cmd supercommand.go:57   args: []string{"/var/lib/juju/tools/machine-2/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "2", "--debug"}
2022-01-10 13:46:29 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 1
2022-01-10 13:46:29 DEBUG juju.agent agent.go:578 read agent config, format "2.0"
2022-01-10 13:46:29 INFO juju.agent.setup agentconf.go:128 setting logging config to "<root>=WARNING;unit=INFO"

On the unit log I don’t see any new log lines after restarting, but there are some errors from november. I don’t know if they are relevant since I don’t think the agent has been lost from all of that time, but here they are:

2021-11-17 21:42:27 ERROR juju.worker.dependency engine.go:676 "leadership-tracker" manifold worker returned unexpected error: error while telegraf/2 waiting for telegraf leadership release: error blocking on le
adership release: lease manager stopped
2021-11-17 21:42:27 ERROR juju.worker.dependency engine.go:676 "log-sender" manifold worker returned unexpected error: cannot send log message: tls: use of closed connection
2021-11-17 21:42:28 INFO unit.telegraf/2.juju-log server.go:327 Reactive main running for hook update-status
2021-11-17 21:42:28 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2021-11-17 21:42:28 WARNING unit.telegraf/2.update-status logger.go:60 ERROR connection is shut down
2021-11-17 21:42:28 ERROR unit.telegraf/2.juju-log server.go:327 Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-telegraf-2/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 73, in main
    hookenv._run_atstart()
  File "/var/lib/juju/agents/unit-telegraf-2/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1312, in _run_atstart
    callback(*args, **kwargs)
  File "lib/charms/layer/basic.py", line 259, in init_config_states
    config = hookenv.config()
  File "/var/lib/juju/agents/unit-telegraf-2/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 436, in config
    subprocess.check_output(config_cmd_line).decode('UTF-8'))
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['config-get', '--all', '--format=json']' returned non-zero exit status 1.

I don’t know what to do next. I’m thinking of giving up and removing the relation and then restoring it. Though I imagine that removing the relation will not work since the agent is lost.

emcp · 10 January 2022 16:08

I can only speak from my experience… when this happened to me… and it is super rare…

BUT

I usually just backup whatever contents the unit had… and remove and reconnect a unit… and by reconnect I mean re-relate.

If a relation isn’t dropping, you likely just need to juju remove-unit <unit_name_here/1> --force

it’s not pretty but… when a unit is on a lone machine… I just chalk it up to something too wrong to take a risk and remove the offending node … if it KEEPS occurring… it could be the juju charm code doing something in a loop that is causing the lost agent (this is something you can check via juju debug-log …

in the end, remember… you’re coding in clouds which can have nodes drop all the time… on AWS, GCP, etc… so my infra should be coded/setup to handle that… and I just rip out the bad unit and reconnect… if you specifically show your juju status --relations after trying to remove the relation… we could maybe inspect more … but I’d backup whatever’s on that node in general and just have another try …

codersquid · 11 January 2022 13:46

I can’t use remove-unit because the telegraf unit is a subordinate one.

Is there anything else I can do if the relation isn’t dropping?

emcp · 11 January 2022 14:37

ah sorry then… I have not yet used subordinate charms myself… but hopefully someone may shed light … if it were me and it was possible I’d have to just chop down the entire unit… as bad as that sounds but… I am not really in a production grade environment myself

manadart · 17 January 2022 14:00

@hmlanigan Does this bear any resemblance to the recent issue you worked on to regain a lost agent?

hmlanigan · 24 January 2022 15:16

I do not see the error message leading to LP 1956975 above. However the bug will have some ideas on investigation and potentially how to restart the agent.