Pebble failing to connect to container on sidecar charm

arturo-seijas · 5 September 2022 13:18

While trying to upgrade the Indico charm -our first sidecar charm- we’ve noticed a strange behavior. The refresh operation was carried out on a charm consisting of two units. Unlike the second unit, the first one was upgraded successfully without any kind of error.

We hit the following exception while refreshing the Indico charm, preventing one of the containers to start. After 14 restarts, the deployment succeeded without any manual intervention.

Traceback (most recent call last):
  File "./src/charm.py", line 537, in <module>
    main(IndicoOperatorCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/main.py", line 429, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/framework.py", line 794, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/framework.py", line 857, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 456, in _on_config_changed
    self._config_pebble(self.unit.get_container(container_name))
  File "./src/charm.py", line 165, in _config_pebble
    self._install_plugins(container, plugins)
  File "./src/charm.py", line 483, in _install_plugins
    process.wait_output()
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/pebble.py", line 1098, in wait_output
    exit_code = self._wait()
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/pebble.py", line 1044, in _wait
    change = self._client.wait_change(self._change_id, timeout=timeout)
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/pebble.py", line 1521, in wait_change
    return self._wait_change_using_wait(change_id, timeout)
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/pebble.py", line 1542, in _wait_change_using_wait
    return self._wait_change(change_id, this_timeout)
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/pebble.py", line 1557, in _wait_change
    resp = self._request('GET', '/v1/changes/{}/wait'.format(change_id), query)
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/pebble.py", line 1297, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-indico-0/charm/venv/ops/pebble.py", line 1344, in _request_raw
    raise ConnectionError(e.reason)

Apparently, pebble was not able to connect to the container, although the code is explicitly checking that is actually possible. The code being executed before the occurrence of the exception is pasted below:

self.framework.observe(self.on.indico_pebble_ready, self._on_pebble_ready)

def _on_pebble_ready(self, event):
       """Handle the on pebble ready event for the containers."""
       if not self._are_relations_ready(event) or not event.workload.can_connect():
           event.defer()
           return
       self._config_pebble(event.workload)


def _config_pebble(self, container):
       """Apply pebble changes."""
       self.unit.status = MaintenanceStatus("Adding {} layer to pebble".format(container.name))
       if container.name in ["indico", "indico-celery"]:
           self._set_git_proxy_config(container)
           plugins = (
               self.config["external_plugins"].split(",")
               if self.config["external_plugins"]
               else []
           )
           self._install_plugins(container, plugins)

def _install_plugins(self, container, plugins):
       """Install the external plugins."""
       if plugins:
           process = container.exec(
               ["pip", "install"] + plugins,
               environment=self._get_http_proxy_configuration(),
           )
           process.wait_output()

As it can be seen from the snippet above, the exec command is installing a plugin. When logging into the container we were able to confirm that the plugin was actually installed although the container kept on crashing.

Does this error actually mean that pebble is not ready yet? We are wondering if we’re missing some mandatory checks before executing a command in the container or if this is abnormal behavior.

benhoyt · 7 September 2022 02:16

Thanks for the report. It’s hard to say without more detail … it’d be good to see a couple of things:

What is the e.reason argument to the ConnectionError? Normally that’s shown in the line directly after the traceback, like so:

Traceback (most recent call last):
  ...
ops.pebble.ConnectionError: reason shown here

Do we know why K8s is restarting that container 14 times? Is there a restart reason given in the K8s diagnostic info? It’s likely either Pebble exiting/crashing, or the K8s readiness/aliveness checks failing. If we can see more detail there that’d help.

If can_connect is successful, exec should work, so Pebble is ready. I’ve confirmed that locally now.

Is this reproducible? If so, perhaps you can point us to the charm repo and repro steps to make it happen locally?

arturo-seijas · 7 September 2022 08:57

Hi Ben,

Thank you for your response.

The ConnectionError reason is ops.pebble.ConnectionError: [Errno 2] No such file or directory

The container crashing is directly caused by pebble not being able to execute the command inside the container because these failures will prevent the service from starting, so the pebble checks will fail.

Unfortunately, we haven’t been able to reproduce the error locally but we’re seeing this issue in two different kubernetes clusters.

The charm repository is https://github.com/canonical/indico-operator

Let me know if there’s any additional info I can provide.

Thank you

mthaddon · 7 September 2022 14:33

We’re seeing some other issues in the logs as well and after some discussions with @jameinel have filed Bug #1989004 “socket.timeout as part of container.exec inside a ...” : Bugs : juju.

Just to note that the container restarts all seem to happen fairly early in the lifecycle of the pod, and then things stabilise. We’ve had 10 restarts on the indico/1 unit in a particular model, but the last restart was at Wed, 07 Sep 2022 08:24:55 +0000 (time of writing Wed, 07 Sep 2022 14:32:01 +0000).