OK I just completely rebuilt the system after this, wiped everything and boostrapped a controller on fresh, empty hardware.
4 days on from that, with only one deployment of charmed kubernetes in a pretty standard way… and good news: the controller’s disk didn’t fill up.
However, the controller is now virtually unresponsive. Not much activity on the processor (using htop
), but it struggles to do anything - no Juju GUI, no response to the CLI, can’t nslookup
or apt install
, but can ping
. Even sudo shutdown -r now
wouldn’t work. So, I restarted the server with the handy MaaS IPMI control. After the restart, apt install
and nslookup
now work. mongod
service starts up and starts using a decent amount of CPU and a tiny bit of IO (disk) activity.
sudo systemctl list-unit-files
results in:
UNIT FILE STATE VENDOR PRESET
juju-clean-shutdown.service enabled enabled
juju-db.service enabled enabled
jujud-machine-0.service enabled enabled
sudo service jujud-machine-0 status
gives:
● jujud-machine-0.service - juju agent for machine-0
Loaded: loaded (/etc/systemd/system/jujud-machine-0.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2020-05-14 10:31:55 BST; 1h 10min ago
Main PID: 779 (bash)
Tasks: 12 (limit: 9374)
Memory: 102.8M
sudo service juju-db status
gives:
● juju-db.service - juju state database
Loaded: loaded (/etc/systemd/system/juju-db.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2020-05-14 11:45:07 BST; 1s ago
Main PID: 23566 (mongod)
Tasks: 3 (limit: 9374)
Memory: 36.0M
sudo service juju-clean-shutdown status
gives:
● juju-clean-shutdown.service - Stop all network interfaces on shutdown
Loaded: loaded (/etc/systemd/system/juju-clean-shutdown.service; enabled; vendor preset: enabled)
Active: inactive (dead)
But there’s still no response whatsoever from juju.
So I checked the logs journalctl -b
, to find repeated WiredTiger
panics:
read checksum error for 4096B block at offset 65536: calculated block checksum of 1624741532 doesn't match expected checksum of 1174969535
WT_SESSION.open_cursor: the process must exit and restart: WT_PANIC: WiredTiger library panic
May 14 10:32:22 pleach.tombull.com mongod.37017[769]: [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 366
May 14 10:32:22 pleach.tombull.com mongod.37017[769]: [initandlisten]
***aborting after fassert() failure
It seems like the juju-db
was just repeatedly restarting mongod
and ignoring the error.
So I ran sudo service juju-db stop
and followed it with sudo mongod --dbpath /var/lib/juju/db --repair
and then sudo service juju-db start
.
Did the juju GUI magically start working? Unfortunately not yet. But after a quick restart (sudo shutdown -r now
works fine now). Everything is back up and running fine.