Hi, I have a juju environment with a subordinate charm that has a lost agent. I’ve restarted jujud on that machine and I’ve also restarted the unit service, but the agent is still lost. I need help trying to recover, but I don’t know what to do next. Any advice?
Here are some log outputs.
For the machine:
2022-01-09 18:49:48 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: codec.ReadHeader error: error receiving message: read tcp 10.50.122.5:40150->10.25.2.109:17070:
read: connection reset by peer
2022-01-09 18:49:48 ERROR juju.api.watcher watcher.go:96 error trying to stop watcher: codec.ReadHeader error: error receiving message: read tcp 10.50.122.5:40150->10.25.2.109:17070: read: connection reset by peer
... this is repeated
2022-01-09 18:50:14 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: [73b7b8] "machine-2" cannot open api: unable to connect to API: read tcp 10.50.122.5:36076->10.
25.2.110:17070: read: connection reset by peer
2022-01-09 18:50:18 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: [73b7b8] "machine-2" cannot open api: unable to connect to API: dial tcp 10.25.2.109:17070: con
nect: connection refused
on restart I get the following:
2022-01-10 13:46:29 INFO juju.cmd supercommand.go:56 running jujud [2.9.21 0 8a154b7d629f6d9c0693aba7accf255789996c14 gc go1.14.15]
2022-01-10 13:46:29 DEBUG juju.cmd supercommand.go:57 args: []string{"/var/lib/juju/tools/machine-2/jujud", "machine", "--data-dir", "/var/lib/juju", "--machine-id", "2", "--debug"}
2022-01-10 13:46:29 DEBUG juju.utils gomaxprocs.go:24 setting GOMAXPROCS to 1
2022-01-10 13:46:29 DEBUG juju.agent agent.go:578 read agent config, format "2.0"
2022-01-10 13:46:29 INFO juju.agent.setup agentconf.go:128 setting logging config to "<root>=WARNING;unit=INFO"
On the unit log I don’t see any new log lines after restarting, but there are some errors from november. I don’t know if they are relevant since I don’t think the agent has been lost from all of that time, but here they are:
2021-11-17 21:42:27 ERROR juju.worker.dependency engine.go:676 "leadership-tracker" manifold worker returned unexpected error: error while telegraf/2 waiting for telegraf leadership release: error blocking on le
adership release: lease manager stopped
2021-11-17 21:42:27 ERROR juju.worker.dependency engine.go:676 "log-sender" manifold worker returned unexpected error: cannot send log message: tls: use of closed connection
2021-11-17 21:42:28 INFO unit.telegraf/2.juju-log server.go:327 Reactive main running for hook update-status
2021-11-17 21:42:28 ERROR juju.worker.dependency engine.go:676 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
2021-11-17 21:42:28 WARNING unit.telegraf/2.update-status logger.go:60 ERROR connection is shut down
2021-11-17 21:42:28 ERROR unit.telegraf/2.juju-log server.go:327 Hook error:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-telegraf-2/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 73, in main
hookenv._run_atstart()
File "/var/lib/juju/agents/unit-telegraf-2/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 1312, in _run_atstart
callback(*args, **kwargs)
File "lib/charms/layer/basic.py", line 259, in init_config_states
config = hookenv.config()
File "/var/lib/juju/agents/unit-telegraf-2/.venv/lib/python3.6/site-packages/charmhelpers/core/hookenv.py", line 436, in config
subprocess.check_output(config_cmd_line).decode('UTF-8'))
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['config-get', '--all', '--format=json']' returned non-zero exit status 1.
I don’t know what to do next. I’m thinking of giving up and removing the relation and then restoring it. Though I imagine that removing the relation will not work since the agent is lost.