So we do support removing a key from relation data using del, and I think we support deleting it by setting it to the empty string. That isn’t the issue that @jamesbeedy is having, though.
I’m a little confused from the pastebin, because in that pastebin, both slurmd/1 and slurmd/2 are considered active, so I would expect slurmd/1 to see slurmd/2’s data.
From the original post, I thought the issue was just that the Operator Framework doesn’t properly remove related units from a relation as a result of something like ‘relation-departed’.
Isn’t this relation-departed
(a hook should occur with the removed unit no longer existing).
It is entirely reasonable to say that we might have a bug in the Operator framework, where we might be caching the relation data for a given unit, and iterating over relation.units and/or relation.data[dead_unit] would be returning data that isn’t valid.
I’d be a bit surprised to have relation.units returning invalid information. AIUI it just iterates over Juju’s relation-list
to figure out what units are part of the relation. And according to the rest of the discussion, that seems to be giving the correct values.
I don’t think we intend to support del relation.data[unit]
because that would be removing the unit from the relation entirely, but the unit still exists, and the relation still exists. Which is very different from del relation.data[unit][key]
which just deletes a single key (which I thought we already supported).
We could have del relation.data[unit]
indicate that it should set everything to empty, but I think you can already do that with relation.data[unit] = {}
. And I find the latter much more obvious as to what would be going on behind the scenes (if we don’t support that yet, I would be happy to have it added).
Just to confirm, I poked around my local test juju controller.
$ juju status --relations
Model Controller Cloud/Region Version SLA Timestamp
default lxd lxd/default 2.8.1 unsupported 16:51:40+04:00
App Version Status Scale Charm Store Rev OS Notes
uo 18.04 active 2 ubuntu-operator local 1 ubuntu
uoo 18.04 active 2 ubuntu-operator local 2 ubuntu
Unit Workload Agent Machine Public address Ports Message
uo/2* active idle 3 10.5.24.63 load: 0.08 0.15 0.18
uo/3 active idle 4 10.5.24.226 load: 0.06 0.14 0.18
uoo/0* active idle 3 10.5.24.63 load: 0.07 0.14 0.18
uoo/1 active idle 4 10.5.24.226 load: 0.19 0.20 0.19
Machine State DNS Inst id Series AZ Message
3 started 10.5.24.63 juju-b676a4-3 bionic Running
4 started 10.5.24.226 juju-b676a4-4 bionic Running
Relation provider Requirer Interface Type Message
uo:peer uo:peer ubuntu-peer peer
uo:ubuntu uoo:other-u ubuntu regular
uoo:peer uoo:peer ubuntu-peer peer
Now we just have uo/2 and uo/3.
$ juju run --unit uo/2 'relation-ids peer'
peer:0
$ juju run --unit uo/2 'relation-list -r peer:0'
uo/3
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/2'
egress-subnets: 10.5.24.63/32
ingress-address: 10.5.24.63
private-address: 10.5.24.63
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/3'
egress-subnets: 10.5.24.226/32
ingress-address: 10.5.24.226
private-address: 10.5.24.226
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/1'
ERROR cannot read settings for unit "uo/1" in relation "uo:peer": unit "uo/1": settings not found
So the unit can see its own data in the peer relation and can see the related unit, but not the unit which has been removed. Just to confirm:
$ juju add-unit uo --to 3
$ juju run --unit uo/2 'relation-list -r peer:0'
uo/3
uo/4
$ juju run --unit uo/4 'relation-set -r peer:0 foo=bar'
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/4'
egress-subnets: 10.5.24.63/32
foo: bar
ingress-address: 10.5.24.63
private-address: 10.5.24.63
$ juju remove-unit uo/4
removing unit uo/4
$ juju run --unit uo/2 'relation-list -r peer:0'
uo/3
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/4'
egress-subnets: 10.5.24.63/32
foo: bar
ingress-address: 10.5.24.63
private-address: 10.5.24.63
So this says there is a Juju bug, where deleting a unit does remove it from the relation (relation-list
doesn’t show it anymore), but relation-get
is still able to get access to that unit’s old relation data.
That might be a Caching coherency issue in the unit agent. Let me try restarting it and see if it can still see the data:
$ juju run --unit uo/2 'systemctl restart jujud-unit-uo-2'
action terminated
$ juju run --unit uo/2 'relation-get -r peer:0 - uo/4'
egress-subnets: 10.5.24.63/32
foo: bar
ingress-address: 10.5.24.63
private-address: 10.5.24.63
And if we check in the Database:
juju:PRIMARY> db.settings.find({"_id": {"$regex": /.*r#0#.*/}}).pretty()
...
{
"_id" : "1a914b3f-a6ed-4bd4-8992-733923b676a4:r#0#peer#uo/4",
"model-uuid" : "1a914b3f-a6ed-4bd4-8992-733923b676a4",
"settings" : {
"egress-subnets" : "10.5.24.63/32",
"private-address" : "10.5.24.63",
"ingress-address" : "10.5.24.63",
"foo" : "bar"
},
"version" : NumberLong(1),
"txn-revno" : NumberLong(3),
"txn-queue" : [
"5f3143c0d07ccc0830cead0d_8a28b79e"
]
}
...
There is the relation data for the uo/4 peer in relation ‘0’.
So Juju isn’t deleting the relation data for units when they are removed from a relation.
Note that this isn’t specific to peer relations, I tried it with 2 applications over a normal provides/subscribes relation and even though ‘relation-list’ says the unit isn’t part of the relation, relation-get will return its old data:
$ juju run --unit uo/2 'relation-list -r 2'
uoo/0
uoo/1
$ juju run --unit uo/2 'relation-get -r 2 - uoo/2'
blah: blah
egress-subnets: 10.5.24.63/32
ingress-address: 10.5.24.63
private-address: 10.5.24.63
Anyway, in the Operator Framework, relation.units should have the correct set of units that are currently part of the relation (though potentially during ‘relation-departed’ while we tell you that unit/4 is gone, we might still report it as present in relation-list).
Relation-changed is always a change to one-particular-data-bag, so it doesn’t quite make sense to trigger relation-changed in response to a unit going away, but it does seem like maybe we should ensure that you can’t read relation data for a unit that no longer exists.