Weird connection troubles in LXD containers on AWS

maaudet · 25 August 2020 14:17

Hello,

It’s now the second time that I’m having this weird issue, the first time it was on a disposable environment (with a more recent version of Juju) so I just recreated it from scratch, but now it’s in a production environment so I don’t have as much luck.

I have 3 EC2 instances as machines created by Juju with 6 LXD containers (2 per servers).

When trying to connect through TCP to some AWS services (Like ElasticSearch or ElastiCache) I get no content replies from the servers, but the connection gets established successfully and it started happening after a reboot, it was working perfectly before that reboot. I do not have this issue with any other machines on the network or online. Now, 1 of those machine haven’t been rebooted yet, but is not having the issue so it makes me wonder if a recent update could have brought the issue.

Everything works fine if the connection is established directly from the machine rather than from inside the containers.

Juju version: 2.4.3
Ubuntu version on all machines and containers: 18.04

The only weird thing I’ve noticed is how they all have 30+ veth interfaces, but I fail to see a link since all 3 machines have those. Could I delete all veth interfaces safely / would LXD recreate the ones it needs?

Anyone had any similar experiences?

simonrichardson · 26 August 2020 08:20

Do you let the connection timeout, or are you killing the connection yourself? And for how long before the timeout/kill?

maaudet · 26 August 2020 13:03

The connection is set to never timeout in the Redis configuration, and when I run time strace -f -e network netcat {{redacted-hostname}} 6379 <<< INFO or when I connect to it using telnet, I get no data replies, but the connection is maintained without error for 10+ minutes. The results from the strace are exactly the same on the working and non-working servers, with the only exception that I actually get output on the working server after.

maaudet · 28 August 2020 19:23

When I run lxc config show --expanded juju-c30acc-0-lxd-1 on the host, I get the following devices:

devices:
  eth0:
    hwaddr: 00:16:3e:61:91:69
    mtu: "8951"
    name: eth0
    nictype: bridged
    parent: fan-252
    type: nic
  eth1:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth2:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth3:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth4:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth5:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth6:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth7:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth8:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth9:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth10:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth11:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth12:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth13:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth14:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth15:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth16:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth17:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth18:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth19:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth20:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth21:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth22:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth23:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth24:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth25:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth26:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth27:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth28:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth29:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth30:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth31:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth32:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth33:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth34:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth35:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth36:
    nictype: bridged
    parent: lxdbr0
    type: nic
  eth37:
    nictype: bridged
    parent: lxdbr0
    type: nic

I don’t know where they come from and I don’t know how to remove them.

maaudet · 28 August 2020 19:42

I found them ! They are in the default profile for some reasons. I wiped all the eth* devices from the default profile, although it didn’t fix my issue.

We’ve had to change the fan-252 device MTUs to 8951 on the host machine, otherwise we’d still have issues. The working machine MTU was set to 8951 already. The MTU was set at 1450 on both non-working hosts.

Now we’re looking at a way to enforce this setting and we’re not sure where to look at.