AWS security group cleanup

jamesbeedy · 22 February 2021 19:22

Hello,

I would like to add some heat if possible to the aws security group cleanup issue that has been plaguing us for years now.

With a quick glance I can find multiple bugs for this issue stemming back to the early juju 2.x days. I’m hoping to bring this bug out front before it sneaks its way into 3.x.

Here are three bugs that seem to all be for this same issue:

The issue manifests as failed deployments with a user facing error that indicates a security group quota has been hit.

juju status shows:

Model            Controller  Cloud/Region   Version  SLA          Timestamp
02a3164-centos7  osl-aws     aws/us-west-2  2.8.6    unsupported  12:09:28-07:00

App                 Version  Status   Scale  Charm               Store       Rev  OS      Notes
percona-cluster              waiting    0/1  percona-cluster     jujucharms  293  ubuntu
slurm-configurator           waiting    0/1  slurm-configurator  local         0  centos
slurmctld                    waiting    0/1  slurmctld           local         0  centos
slurmd                       waiting    0/1  slurmd              local         0  centos
slurmdbd                     waiting    0/1  slurmdbd            local         0  centos
slurmrestd                   waiting    0/1  slurmrestd          local         0  centos

Unit                  Workload  Agent       Machine  Public address  Ports  Message
percona-cluster/0     waiting   allocating  0                               waiting for machine
slurm-configurator/0  waiting   allocating  1                               waiting for machine
slurmctld/0           waiting   allocating  2                               waiting for machine
slurmd/0              waiting   allocating  3                               waiting for machine
slurmdbd/0            waiting   allocating  4                               waiting for machine
slurmrestd/0          waiting   allocating  5                               waiting for machine

Machine  State    DNS  Inst id  Series   AZ  Message
0        pending       pending  bionic       failed to start machine 0 (cannot set up groups: creating security group "juju-dad7c9ed-e8a5-49e9-82ce-fe6cd7a3043c": The maximum number of security groups has been reached. (SecurityGroupLimitExceeded)), retrying in 10s (5 more attempts)
1        pending       pending  centos7
2        pending       pending  centos7
3        pending       pending  centos7
4        pending       pending  centos7
5        pending       pending  centos7

Looking at the aws console I see that I have 1000s of security groups that juju has created but has not cleaned up.

Help!

simonrichardson · 22 February 2021 21:00

We spotted this in CI tests from time to time…
Just a quick glance at the code, I would first check this piece of code to see if these filters are correct… https://github.com/juju/juju/blob/e7c75df8973628b3359bb1640504e00d828d219e/provider/ec2/environ.go#L2008

jamesbeedy · 22 February 2021 22:55

@simonrichardson see instance lifecycle - it looks like the states are correct.

Possibly the instance is in the stopped state when then juju attempts to delete the security groups, which is why they don’t get picked up/deleted, because the instance is not yet in one of the two states; terminated, shutting-down.

Per the issue in the comments a few lines below what you have linked - possibly it is worth revisiting how security groups are cleaned up now that security groups support tagging.

Thoughts?

wallyworld · 23 February 2021 06:39

I’ve marked 2 of the bugs as duplicates (essentially the same issue) and added the bug to the 2.9.1 milestone - we’re hoping to go to final 2.9 RC ASAP so it’s unlikely we’ll get a fix in 2.9.0. Hopefully we can get something done for the .1 release.

wallyworld · 24 February 2021 02:28

I’ve left some comments on the Launchpad bug #1720571. There’s specifically code to delete a machine’s security group when the instance is terminated, and also the model’s security group when the model is destroyed. Certainly, I was unable to reproduce the issue - I could remove machines and destroy models and verified that the security groups also were deleted.

It’s hard to tell from the screenshot what the security groups are - model or machine. The machine ones have the machine id appended at the end. The UUID in the sg name is the model UUID.
Are those groups from recent models/machines? Or are they really stale from previously?

jamesbeedy · 24 February 2021 13:10

@wallyworld +1 to doing a final swoop on the security groups when the model is deleted.

These all appear to be machine sg’s.

Lets look at a model that we have been beating on over the last few days:

$ juju status
Model           Controller  Cloud/Region   Version  SLA          Timestamp
heitor-testing  osl-aws     aws/us-west-2  2.8.6    unsupported  06:05:51-07:00

App                 Version  Status  Scale  Charm               Store       Rev  OS      Notes
percona-cluster     5.7.20   active      1  percona-cluster     jujucharms  293  ubuntu
slurm-configurator  20.11.3  active      1  slurm-configurator  local        10  ubuntu
slurmctld           20.11.3  active      1  slurmctld           local         8  ubuntu
slurmd              20.11.3  active      2  slurmd              local         7  ubuntu
slurmdbd            20.11.3  active      1  slurmdbd            local         7  ubuntu
slurmrestd          20.11.3  active      1  slurmrestd          local         7  ubuntu  exposed

Unit                   Workload  Agent  Machine  Public address  Ports     Message
percona-cluster/6*     active    idle   131      172.31.80.140   3306/tcp  Unit is ready
slurm-configurator/9*  active    idle   132      172.31.81.80              slurm-configurator available
slurmctld/7*           active    idle   133      172.31.82.149             slurmctld available
slurmd/76*             active    idle   134      172.31.80.8               slurmd available
slurmd/77              active    idle   137      172.31.81.18              slurmd available
slurmdbd/7*            active    idle   135      172.31.83.34              slurmdbd available
slurmrestd/7*          active    idle   136      172.31.81.187   6820/tcp  slurmrestd available

Machine  State    DNS            Inst id              Series  AZ          Message
131      started  172.31.80.140  i-00fbb042f69cf3f0a  bionic  us-west-2a  running
132      started  172.31.81.80   i-0ee87007c9a20e401  focal   us-west-2b  running
133      started  172.31.82.149  i-0106ec501b722dc2f  focal   us-west-2c  running
134      started  172.31.80.8    i-067c8cda50c8b4596  focal   us-west-2a  running
135      started  172.31.83.34   i-0d45f4afa74ab64f1  focal   us-west-2d  running
136      started  172.31.81.187  i-04820c7827657bc62  focal   us-west-2b  running
137      started  172.31.81.18   i-01ab82d80b9b5ba4b  focal   us-west-2b  running

Get the model uuid for this model

 $ cat ~/.local/share/juju/models.yaml | grep heitor-testing -A 1
      admin/heitor-testing:
        uuid: af9522c1-fc3c-41aa-8ad5-0a8229d272ff

Looking in aws security groups for the model uuid I see 94 machine sgs that were created under this model.

The latest machine in the model is 136, possibly this is an indicator that some of the sg’s are getting cleaned up.

I can bootstrap a new controller and try this experiment with a fresh model at some point today or tomorrow. I’ll leave feedback here.

Thank you!

jamesbeedy · 24 February 2021 14:52

Using a 2.9rc6 controller/model I get leftover security groups for every machine.

wallyworld · 24 February 2021 23:03

When an instance is deleted, the last thing it does is delete the associated security group.
The only thing I can think of that would prevent this is if the instance does not immediately transition to “shutting-down” state when it is terminated (it is supposed to IIANM).

In 2.8 and 2.9 edge snaps, there’s a warning logged if the instance state is not as expected after completing the request to terminate. So it would be good to know if you see such warnings. Or maybe there’s an error in your logs able not being able to delete the security group.

I haven’t been able to reproduce the issue, so any extra logging to help identify the cause would be good to have.

jamesbeedy · 25 February 2021 14:04

Here are the trace logs for the model during application removal.

The command that was ran:

juju remove-application percona-cluster \
                        slurm-configurator \
                        slurmctld \
                        slurmd \
                        slurmdbd \
                        slurmrestd

controller logs

wallyworld · 25 February 2021 22:34

The code which does the instance termination in the cloud runs in the controller model. The controller logs as attached don’t show much so getting DEBUG or TRACE logging there would be good.
Was this done with the latest 2.8 or 2.9 edge snap? If so, there’s no WARNING about a terminated instance having an unexpected state so there’s no obvious reason why the security group would no have been removed.
Are the machine security groups modified in any way after Juju creates them?

jamesbeedy · 25 February 2021 23:16

Opps, I see.

here are the DEBUG controller logs for removing an application “ubuntu”, with a single unit.

The security group that is leftover from the removed ubuntu application (in the controller log):

The security groups that get created don’t seem to have any rules other then allow all outbound.

wallyworld · 26 February 2021 05:04

I think I was mistaken about which logs were best to have. I can see in the model logs

controller-0: 06:53:48 INFO juju.worker.provisioner stopping known instances [i-0b7d25752d887d052 i-0eb35568c7d2049de i-02ad7604c89496314]
controller-0: 06:53:49 INFO juju.worker.provisioner removing dead machine "30"
controller-0: 06:53:49 INFO juju.worker.provisioner removing dead machine "28"
controller-0: 06:53:49 INFO juju.worker.provisioner removing dead machine "32"
controller-0: 06:53:49 INFO juju.worker.provisioner removing dead machine "29"

That stopping known instances line happens when “terminate” is called via the EC2 API. Juju will then ask EC2 for the security groups belonging to these terminated instances and will delete the groups. There will be a DEBUG line like this:

instance "i-0b7d25752d887d052" has security groups [<blah>]

No such line is emitted, which means that no instance security groups are found. The query is set up to use the “DescribeInstances” API call to get the instances and then read the security groups attribute, filtering on instances which are in state “shutting-down” or “terminated”. Because “terminate” was called previously, the instance should at least be “shutting-down”. The 2.8 and 2.9 edge snap emits a WARNING if the instance is not in this state… this is the only way I can see why we would not get the security groups to remove.

I can see no such warning in the logs. Can you verify that the latest 2.8 / 2.9 edge snap was used to test, or try again with the 2.8 / 2.9 edge snap. Because I can’t reproduce, we’ll need to try and get that extra logging to try and pin point why the query to get the security groups to delete is failing.

jamesbeedy · 26 February 2021 12:26

@wallyworld I’ve been using 2.9-rc6. What logs do you need? Ex <juju>=TRACE on the controller model?

wallyworld · 26 February 2021 21:18

juju 2.9-rc6 will log a WARNING if the terminating instance is not in the expected state (either “shutting-down” or “terminated”) when the security groups are queried. This is the only reason I can see off hand why removing the security group would be skipped. You don’t need TRACE; the WARNING should just be logged in the model logs.

If there’s nothing there, the instance security group should be found and removed; if the removal fails, the logs should contain an error.

I’ll try again to reproduce, but so far no luck getting it to misbehave. Every time I try both the instance and security group are deleted as expected.