I’m provisioning a Canonical Charmed Kubernetes cluster on Azure using a custom shell script that bootstraps Juju on a VM and attaches worker VMs to a VMSS (Flexible orchestration mode). During juju bootstrap, MongoDB is installed and started via snap, but the replicaset initiation fails with the error, Please find the below logs.
ap.go:462 starting mongo
DEBUG juju.agent agent.go:978 potential mongo addresses: [localhost:37017]
DEBUG juju.cmd.jujud bootstrap.go:491 calling EnsureMongoServerInstalled
INFO juju.mongo mongo.go:242 Ensuring mongo server is running; data directory /var/snap/juju-db/common; port 37017
INFO juju.packaging manager.go:81 installing "juju-db" via "snap"
INFO juju.packaging.manager run.go:88 Running: snap install --channel 4.4/stable juju-db
DEBUG juju.mongo mongo.go:427 using mongod: /snap/bin/juju-db.mongod --version:
db version v4.4.24
Build Info: {
"version": "4.4.24",
"gitVersion": "0b86b9b7b42ad9970c5f818c527dd86c0634243a",
"openSSLVersion": "OpenSSL 1.1.1f 31 Mar 2020",
"modules": [],
"allocator": "tcmalloc",
"environment": {
"distarch": "x86_64",
"target_arch": "x86_64"
}
}
INFO juju.service.snap snap.go:431 running snap command: [services juju-db]
INFO juju.service.snap snap.go:431 running snap command: [services juju-db]
DEBUG juju.core.network address.go:565 selected "135.119.177.105" as address, using scope "public"
DEBUG juju.worker.peergrouper initiate.go:37 Initiating mongo replicaset; dialInfo &mgo.DialInfo{Addrs:[]string{"localhost:37017"}, Direct:false, Timeout:300000000000, SyncTimeout:0, SocketTimeout:0, FailFast:false, Database:"", ReplicaSetName:"", Source:"", Service:"", ServiceHost:"", Mechanism:"", Username:"", Password:"", PoolLimit:0, DialServer:(func(*mgo.ServerAddr) (net.Conn, error))(0x19486a0), Dial:(func(net.Addr) (net.Conn, error))(nil)}; memberHostport "135.119.177.105:37017"; user ""; password ""
DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
DEBUG juju.mongo open.go:160 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: connect: connection refused
DEBUG juju.mongo open.go:174 dialed mongodb server at "127.0.0.1:37017"
INFO juju.replicaset replicaset.go:58 Initiating replicaset with config: {
Name: juju,
Version: 1,
Term: 0,
Protocol Version: 1,
Members: {
{1 "135.119.177.105:37017" juju-machine-id:0 voting},
},
}
DEBUG juju.mongo open.go:174 dialed mongodb server at "127.0.0.1:37017"
INFO juju.replicaset replicaset.go:60 Unsuccessful attempt to initiate replicaset: No host described in new configuration with {version: 1, term: 0} for replica set juju maps to this node
INFO juju.replicaset replicaset.go:58 Initiating replicaset with config: {
Name: juju,
Version: 1,
Term: 0,
Protocol Version: 1,
Members: {
{1 "135.119.177.105:37017" juju-machine-id:0 voting},
},
}
My provision script is
az vmss scale --name jujuvmss --resource-group "$RESOURCE_GROUP_CANONICAL" --new-capacity "$WORKER_NODE_COUNT"
# Wait for VMSS instances to be provisioned
echo "Waiting for $WORKER_NODE_COUNT VMSS instances to be ready..."
for i in {1..40}; do
count=$(az vmss list-instances -g "$RESOURCE_GROUP_CANONICAL" -n jujuvmss --query "length([])" -o tsv || echo 0)
if [[ "$count" -ge "$WORKER_NODE_COUNT" ]]; then
echo "$count VMSS instance(s) ready."
break
fi
sleep 15
done
# Assign public IPs to all VMSS VMs and prepare SSH config
readarray -t VM_NAMES < <(az vm list -g "$RESOURCE_GROUP_CANONICAL" --query "[?contains(name, 'jujuvmss')].name" -o tsv)
ip_addrs=()
for i in "${!VM_NAMES[@]}"; do
VM_NAME=${VM_NAMES[$i]}
IP_NAME="juju-vmss-ip-$i"
az network public-ip create -g "$RESOURCE_GROUP_CANONICAL" -n "$IP_NAME" --sku Standard --allocation-method Static --location "$LOCATION"
NIC_ID=$(az network nic list -g "$RESOURCE_GROUP_CANONICAL" --query "[?virtualMachine.id!=null && contains(virtualMachine.id, '$VM_NAME')].id | [0]" -o tsv)
NIC_NAME=$(basename "$NIC_ID")
IP_CONFIG_NAME=$(az network nic show -g "$RESOURCE_GROUP_CANONICAL" -n "$NIC_NAME" --query "ipConfigurations[0].name" -o tsv)
az network nic ip-config update -g "$RESOURCE_GROUP_CANONICAL" -n "$IP_CONFIG_NAME" --nic-name "$NIC_NAME" --public-ip-address "$IP_NAME"
IP_ADDR=$(az network public-ip show -g "$RESOURCE_GROUP_CANONICAL" -n "$IP_NAME" --query "ipAddress" -o tsv)
ip_addrs+=("$IP_ADDR")
done
# Configure SSH for the juju user
> ~/.ssh/config
chmod 600 ~/.ssh/config
for ip in "${ip_addrs[@]}"; do
cat <<EOF >> ~/.ssh/config
Host $ip
User juju
IdentityFile $(pwd)/jujuKey
StrictHostKeyChecking no
EOF
done
eval "$(ssh-agent -s)"
ssh-add jujuKey
# Set hostname and /etc/hosts on each VM to ensure MongoDB replica set works
for i in "${!ip_addrs[@]}"; do
ip="${ip_addrs[$i]}"
echo "Setting hostname and /etc/hosts on $ip..."
ssh juju@"$ip" <<EOF
sudo hostnamectl set-hostname "$ip"
echo "$ip $ip" | sudo tee -a /etc/hosts
EOF
done
# Configure Juju manual cloud definition for VMSS
cat <<EOF > /tmp/manual-cloud.yaml
clouds:
azure-vmss-manual:
type: manual
EOF
juju add-cloud --client -f /tmp/manual-cloud.yaml || echo "Cloud may already exist"
# Bootstrap Juju on first VMSS instance (controller) using IP
bootstrap_ip="${ip_addrs[0]}"
juju bootstrap manual/"$bootstrap_ip" "$CONTROLLER_NAME" --bootstrap-series=jammy --constraints "instance-type=$VM_SIZE" --debug
Question/Request: How can I resolve the MongoDB replicaset bootstrap failure? Is this due to the hostname/IP mapping or VMSS IP allocation timing? Is there a recommended approach to ensure the Mongo replicaset recognizes the bootstrap node correctly?