My first question, After so many deployments ceph and rados-gw just don't seem to work

themightyelk · 19 May 2022 16:01

Hi everyone, my name is Paul and I have been pulling my hair out lately trying to understand this issue.

I am using a yaml deployment for juju to deploy OpenStack to my hardware stack. The initial setup for hardware is as follows

5X compute systems 3x Controllers 3 Ceph-units 2X 10gb sfp+ switches in stack mode

After deployment the board goes green but glance-simplestreams-sync does not work. The ceph cluster consistently shows slow or high ops or heatbeat up to 49300 seconds and higher. The volume creation doesn’t work. The manual image creation for glance is in error state as well. Ceph used to be the most stable in previous deployments of open-stack and i looked back at my yaml revisions and stable 316 was the last time it seemed to be problem free. I have been forum crawling for weeks and am not finding anything that has pointed me in the correct direction. I had 3 other units i was using for ceph which had an issue with bonding and switch communication. After changing them out for new units the ceph system deployed seemingly drama free however it is still in error. I know there has to be something i am missing but i am failing to figure out what. Please let me know what i can provide to help show the issue. i do see that OpenStack is showing part of the ceph communication is down but nothing seems to bring it up.

pmatulis · 19 May 2022 17:43

Hi Paul. Sorry to hear about your troubles. Can you provide a view of your deployment model via a generated bundle:

juju export-bundle

Also, a full status output of the model would be very useful:

juju status --relations

Kindly use a pastebin (e.g. https://pastebin.ubuntu.com).

themightyelk · 19 May 2022 23:17

Unfortunately I re-rolled it since this morning. I have the yaml that i used to deploy with so hopefully that will help. to this point ive been changing tons of different settings struggling to break or fix something enough to get a real answer. I was able to identify that there is some communication issue between cinder and ceph but nothing in the logs that pointed out what. Ubuntu Pastebin

pmatulis · 19 May 2022 23:55

You are using channels with legacy charms (those that do not support channels):

    charm: cs:cinder
    channel: stable

And the channel you’ve chose is actually a risk level. A channel consists of: <track-name>/<risk-level>.

It looks like you want to deploy OpenStack Xena on Ubuntu 20.04 LTS. So the above would make more sense like this:

    charm: ch:cinder
    channel: xena/stable

I am writing a tutorial whose objective is to deploy OpenStack Yoga on Ubuntu Focal using a bundle. It should be available soon. In the meantime, you can go through the OpenStack Charms Deployment Guide. It shows how to deploy OpenStack Yoga on Ubuntu Jammy on a charm-by-charm basis. It’s great for learning how Juju and OpenStack fit together.

themightyelk · 19 May 2022 23:59

Ok, Ill change the config and try it again. We changed from Xena to Wallaby just to test things and it worked somewhat but still had the same issues with ceph, cinder and glance. Cannot seem to get cinder or really anything to talk through to ceph. we are seeing lots of errors with DB communication. it took me a while to find out the issue with the change in bootstrap>jujuCluster in the mysqlrouter.conf file but i didnt realize the charm channel worked like that. Ill try a reroll with Xena and see if it works ok.

themightyelk · 20 May 2022 03:35

@ pmatulis The deployment went ok for the most part but the problem still persists. Volumes create eternally, images wont sync from glance-cimplestreams-sync nor can we create them with openstack image create cli commands. openstack volume service list shows backends up and enabled. the devices can ping each other. here is an updated yaml with the suggested change from cs to ch (I appreciate that, never would have figured that one out) we’ve been using similar revisions of the “working” juju deployment yaml for quite a while and never had this many problems. im attaching the juju bundle export and some logging info from the stack. I am just not sure why the issue is happening. i can see it is looking for something in sql however the juju push comes up green and i have looked through the configs for the individual portions of the stack and they seem ok.

YAML: Ubuntu Pastebin Logs/info: Ubuntu Pastebin

pmatulis · 20 May 2022 20:49

The xena track is not yet populated. Apologies for being unclear. Charm metadata can be queried with the juju info command. You will see that there is nothing available:

juju info --series focal keystone | grep xena/stable
  xena/stable:         --

The Yoga-Jammy tutorial is now published. This covers the use of channels and a bundle. The previously-mentioned Deploy Guide currently provides Yoga-Focal without using bundle (manual charm deploy). I recommend going through one of these guides to give you a working baseline.

themightyelk · 21 May 2022 00:46

ok, Apologies that I misunderstood. It makes a-lot more sense now and we attempted a redeploy with what we think is a workable yaml based on the config and the information in the guide. the challenge is still we get it deployed and cannot get any images to load manually or through glance-simplestreams-sync and volume creation fails or spins forever. we have been working on openstack for almost a year and have deployed successfully multiple iterations of openstack with Maas/juju and normally ceph and cinder are the consistently known good working programs. we purchased new hardware to deploy this system to which will be our new production system and nothing will work volume wise. I am stumped. this is now week 3 of struggling with the ceph, cinder-ceph, glance etc im attaching the yaml export of the deployment we just performed and its all green save ceph-dashboard but thats not important really. same image timeout error on manual creation so no way to deploy instances. no visible errors

themightyelk · 24 May 2022 00:11

Ok, I have used the example .yaml that is in the tutorial you mentioned. ceph-radosgw is stuck now and the openstack system is stalled in the same spot. here is the yaml I modified: https://docs.openstack.org/charm-guide/latest/_downloads/d61865dae585d1db0ee86cbec5b3cc8e/bundle-focal-yoga.yaml

the deployed yaml is here: Ubuntu Pastebin

the output of juju status is here: Ubuntu Pastebin

I am completely stumped as far as what direction to go in. ceph-radosgw seems to just stuck every time i run an yaml in any config with any changes… after tons of prodding i can usually get it to come up but then nothing works. Glance wont take an image and cinder wont make volumes but the openstack volume service list shows backends on line. this roll is a non-ha installation i only modified your example yml for my hardware and network. Additionally i have tried deploying the ceph system on multiple different machines with different hardware configs. all of them known good and known working from previous open-stack deployments.

finally gave the image create error again: ConflictException: 409: Client Error for url: https://10.1.7.38:9292/v2/images/96947094-2b6a-4317-904f-580f6d5f8450/file, Conflict

pmatulis · 24 May 2022 01:47

Hi Paul. I will go through the tutorial again and check Glance, Ceph RADOS Gateway, and Cinder. In the meantime, can you provide the relations information in your status output?

juju status --relations

Also, I noticed that your cloud was still deploying when you provided the status output.

themightyelk · 24 May 2022 03:34

pmatulis the “deploying” isn’t. this is one of the things i was talking about on all my deployments. the ceph-radosgw gets stuck… it just hangs at either relations or it hangs at updating or it hangs and says it cannot find the service. it just kinda stops at some point in the deployment. Sometimes I can get it to wake up with a reboot of the lxd sometimes I need to reboot the entire host machine. sometimes I have had to re-install the charm a couple of times before it comes up green but still nothing writes data to ceph. the gateways are green all the relations are good from all the documentation I have seen.

one host is down due to a thunderstorm causing a power event but the following is the output with the relations tag.

I feel like something has changed and I don’t know what. We’ve rolled openstack probably 55 times in development while testing different hardware and network configurations etc. Ceph used to be the only part that consistantly rolled out ok, only now has the ceph/cinder/glance portion just failed outright in some odd way. I have configs going back to almost a year ago that rolled fine. the irony is now were out of testing and deploying to do penetration tests but the stack fails.

I have gone bug hunting to see if there is something that is just causing communications issues but I havent been able to find anything concrete that would explain the behavior.

pmatulis · 24 May 2022 13:37

I see. What do the Juju unit agent logs say?

juju debug-log --replay --no-tail -i ceph-radosgw/0
juju debug-log --replay --no-tail -i ceph-mon/0

themightyelk · 24 May 2022 16:22

the debug log from radosgw: Ubuntu Pastebin

the debug log from ceph-mon: Ubuntu Pastebin

logged into the instance and ran journalctl -b | grep radosgw to find this : Ubuntu Pastebin

the “unit-ceph-mon-0: 04:39:44 INFO juju.worker.uniter.operation ran “update-status” hook (via explicit, bespoke hook script) unit-ceph-mon-0: 04:45:08 INFO unit.ceph-mon/0.juju-log Updating status.” keeps repeating but i truncated it in the log. for readability.

themightyelk · 24 May 2022 16:27

also from another forum on radosgw issues i ran the following and got this output.

juju ssh ceph-radosgw/0 sudo systemctl status ceph-radosgw@rgw.juju-e94665-0-lxd-1 ● ceph-radosgw@rgw.juju-e94665-0-lxd-1.service - Ceph rados gateway Loaded: loaded (/lib/systemd/system/ceph-radosgw@.service; disabled; vendor preset: enabled) Active: inactive (dead) Connection to 10.1.7.48 closed.

i dont know if that is overly helpful.

I looked into cinder as well:

unit-cinder-0: 16:17:09 WARNING unit.cinder/0.update-status ERROR no relation id specified unit-cinder-0: 16:17:10 INFO unit.cinder/0.juju-log Installing crontab: /etc/cron.d/cinder-volume-usage-audit unit-cinder-0: 16:17:11 INFO unit.cinder/0.juju-log get_network_addresses: [(‘10.1.7.55’, ‘10.1.7.55’)] unit-cinder-0: 16:17:11 INFO unit.cinder/0.juju-log Unit is ready unit-cinder-0: 16:17:12 INFO juju.worker.uniter.operation ran “update-status” hook (via explicit, bespoke hook script) unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Registered config file: /etc/cinder/cinder.conf unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Registered config file: /etc/cinder/api-paste.ini unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Registered config file: /etc/cinder/policy.json unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Registered config file: /etc/haproxy/haproxy.cfg unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Registered config file: /etc/apache2/sites-available/openstack_https_frontend.conf unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Registered config file: /etc/apache2/ports.conf unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Registered config file: /etc/memcached.conf unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Registered config file: /etc/apache2/sites-enabled/wsgi-openstack-api.conf unit-cinder-0: 16:21:37 INFO unit.cinder/0.juju-log Updating status. unit-cinder-0: 16:21:38 WARNING unit.cinder/0.update-status ERROR no relation id specified unit-cinder-0: 16:21:38 INFO unit.cinder/0.juju-log Installing crontab: /etc/cron.d/cinder-volume-usage-audit unit-cinder-0: 16:21:39 INFO unit.cinder/0.juju-log get_network_addresses: [(‘10.1.7.55’, ‘10.1.7.55’)] unit-cinder-0: 16:21:39 INFO unit.cinder/0.juju-log Unit is ready unit-cinder-0: 16:21:40 INFO juju.worker.uniter.operation ran “update-status” hook (via explicit, bespoke hook script)

and glance:

unit-glance-0: 16:16:33 INFO unit.glance/0.juju-log get_network_addresses: [(‘10.1.7.38’, ‘10.1.7.38’)] unit-glance-0: 16:16:34 INFO unit.glance/0.juju-log Unit is ready unit-glance-0: 16:16:34 INFO juju.worker.uniter.operation ran “update-status” hook (via explicit, bespoke hook script) unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Updating status. unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Making dir /var/lib/charm/glance root:root 555 unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Making dir /etc/ceph root:root 555 unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Registered config file: /etc/glance/glance-api.conf unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Registered config file: /etc/haproxy/haproxy.cfg unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Registered config file: /var/lib/charm/glance/ceph.conf unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Registered config file: /etc/apache2/sites-available/openstack_https_frontend.conf unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Registered config file: /etc/memcached.conf unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Registered config file: /etc/glance/glance-swift.conf unit-glance-0: 16:21:39 INFO unit.glance/0.juju-log Registered config file: /etc/glance/policy.yaml unit-glance-0: 16:21:39 WARNING unit.glance/0.update-status ERROR no relation id specified unit-glance-0: 16:21:41 INFO unit.glance/0.juju-log get_network_addresses: [(‘10.1.7.38’, ‘10.1.7.38’)] unit-glance-0: 16:21:41 INFO unit.glance/0.juju-log Unit is ready unit-glance-0: 16:21:42 INFO juju.worker.uniter.operation ran “update-status” hook (via explicit, bespoke hook script)

attempted to remove and re-install the ceph-radosgw and it didn’t work. i tried to clean it including relations and it wont remove or see the swift-ha relation:

mwsadminprod@spk-r1-maas-prod-1:~$ juju status --relations | grep ceph-radosgw ceph-radosgw unknown 0 ceph-radosgw stable 499 no ceph-radosgw:cluster ceph-radosgw:cluster swift-ha peer mwsadminprod@spk-r1-maas-prod-1:~$ juju remove-relation ceph-radosgw:cluster ceph-radosgw:cluster ERROR no relations found mwsadminprod@spk-r1-maas-prod-1:~$ juju status --relations | grep ceph-radosgw ceph-radosgw unknown 0 ceph-radosgw stable 499 no ceph-radosgw:cluster ceph-radosgw:cluster swift-ha peer mwsadminprod@spk-r1-maas-prod-1:~$ juju remove-relation ceph-radosgw:cluster ceph-radosgw:cluster

Full syslog from radosgw that is stuck after fresh deploy: Ubuntu Pastebin

pmatulis · 24 May 2022 22:36

I went through the tutorial again and it works as advertised. I created a Cinder test volume without issue and Glance allowed me to create a VM.

It could be that there is something amiss with your local environment. Perhaps machines/containers running out of disk space?

Since all your symptoms appear storage related, as a test, you may consider removing Ceph from the equation.

themightyelk · 24 May 2022 23:09

I know it isn’t a disk space issue. 4.3% utilization. We have been pulling on a thread but unsure if its useful. we noticed the reason radosgw is hanging has to do with haproxy using port 80 on both ipv4 and ipv6. this stops apache2 from starting. haproxy cannot see itself and then the radosgw just wont finish startup

May 24 22:42:32 juju-8ac48b-0-lxd-1 systemd[1]: ceph-radosgw@rgw.juju-8ac48b-0-lxd-1.service: Main process exited, code=exited, status=1/FAILURE May 24 22:42:32 juju-8ac48b-0-lxd-1 systemd[1]: ceph-radosgw@rgw.juju-8ac48b-0-lxd-1.service: Failed with result ‘exit-code’. May 24 22:42:32 juju-8ac48b-0-lxd-1 systemd[1]: ceph-radosgw@rgw.juju-8ac48b-0-lxd-1.service: Scheduled restart job, restart counter is at 2. May 24 22:42:32 juju-8ac48b-0-lxd-1 radosgw: 2022-05-24T22:42:32.987+0000 7fc572cede40 0 deferred set uid:gid to 64045:64045 (ceph:ceph) May 24 22:42:32 juju-8ac48b-0-lxd-1 radosgw: 2022-05-24T22:42:32.987+0000 7fc572cede40 0 ceph version 17.1.0 (c675060073a05d40ef404d5921c81178a52af6e0) quincy (dev), process radosgw, pid 129616 May 24 22:42:32 juju-8ac48b-0-lxd-1 radosgw: 2022-05-24T22:42:32.987+0000 7fc572cede40 0 framework: beast May 24 22:42:32 juju-8ac48b-0-lxd-1 radosgw: 2022-05-24T22:42:32.987+0000 7fc572cede40 0 framework conf key: port, val: 70 May 24 22:42:32 juju-8ac48b-0-lxd-1 radosgw: 2022-05-24T22:42:32.987+0000 7fc572cede40 1 radosgw_Main not setting numa affinity May 24 22:42:32 juju-8ac48b-0-lxd-1 radosgw: 2022-05-24T22:42:32.991+0000 7fc572cede40 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0 May 24 22:42:32 juju-8ac48b-0-lxd-1 radosgw: 2022-05-24T22:42:32.991+0000 7fc572cede40 1 D3N datacache enabled: 0 May 24 23:02:32 juju-8ac48b-0-lxd-1 radosgw: 2022-05-24T23:02:32.986+0000 7fc57181c700 -1 Initialization timeout, failed to initialize May 24 23:02:32 juju-8ac48b-0-lxd-1 radosgw[129616]: 2022-05-24T23:02:32.986+0000 7fc57181c700 -1 Initialization timeout, failed to initialize

billy-olsen · 2 June 2022 04:26

@chris.macnaughton any suggestions?

themightyelk · 2 June 2022 16:11

Ok, so it somewhat works now. I chased a bunch of keystone issued and issues with ceph and the ceph / radosgw units not getting info / connection to keystone. I gave up and rebuilt everything from scratch. Rebuilt maas and installed the package version not the snap. same with juju. Reconfigured all the switches and routing equipment and changed the units i used for ceph. I noticed there were some changes in documentation so we went through it all again and re-rolled it with the new configuration and hardware and it is working. I am going to make a 2nd stack and try to redeploy a cut down version of the yaml with the ceph units to see if any changes to code, maas or juju did the trick. Upside my openstack is up and running for now but with non prod hardware for Ceph however it fulfills the urgent need for testing. If I can narrow down any other information to assist anyone else who may have this issue I will but for now I still don’t know what actually caused the issues.

Tl;DR thank you for all the information and support. I still don’t know why it didn’t work but it works now ish?

possible issues could be a glitch in a firmware on the intel intel X540-AT2 network card that causes packet problems when in bond mode in ubuntu… but i also used solarflare and X520-DA2 and 3 different sets of 3 machines with various configs so i am leaning on it being the maas and juju snap installs…

chris.macnaughton · 2 June 2022 17:01

Hey @themightyelk,

Glad that it seems to be working now, but I’m very concerned about the apparent relation errors between ceph-radosgw and keystone. It looks like there were hook errors in a relation-changed hook, paired with an incomplete identity relation. If you replicate the issue, I’d be very interested in the output of juju show-unit ceph-radosgw/0 (or other failing unit) that’s been sanitised from any sensitive information

I’m also curious about your network reconfiguration, as reading through the beginning, I was suspecting a networking issue, and then later becomes a suspicion about a relation issue.

themightyelk · 2 June 2022 17:45

Ill see if I can find the docs and logs I saved. I grabbed a bunch of stuff including pulling info from the keystone databases to try and see if it wasn’t making tables or DB but unfortunately I don’t have a deep knowledge of what is supposed to be there. I may be able to replicate the error in a test environment but I have to move forward with testing now that the main stack is operational. Ill reply again with more log and info once I get it.

Original hardware config. Maas: ubuntu 20.04 with snap also juju snap Supermicro X10SLL-F E3-1271 V3 32Gb ram. dual 10gb sfp+ network connections in bond0 All systems are connected with dual 10gb nic in bond0

Switches: 2X Dell powerconnect 8024f configured in stack mode. ports 1-16 trunk mode with native vlan 6 (pxe vlan) 17-20 trunk only (stack cables) 21 trunk only (maas) 22 trunk and vlan 6 native for juju and 23 for router trunk only. each system had 1 connection per switch so CTL1 port 1 was plugged into switch 1. ctl 1 port 2 was in switch 2 port 1 etc.

juju controller is same hardware as maas 3x ctl machines : zt systems z1040hf configC dual e5-2680 V4 with 64gb ram and 2 ssd in raid 1 (created by maas) also dual 10gb sfp+ nic 82599ES 5x cmp machines: zt systems z1040hf configC dual e5-2680 V4 with 256gb ram and 2 ssd in raid 1 (created by maas) also dual 10gb sfp+ nic 82599ES 3X ceph machines config 1 (intended config): supermicro x9drd-it+ 32gb ram dual e5-2630V2 1X 500gb samsung ssd (boot) 3X samsung 870 1tb ssd (instance storage) 3X st2000lm015 2tb 2.5 hdd (additional file storage) 3X Ceph (test system 1) single cpu supermicro e5-2650 v2 (if i remember) with 32 gb ram 1X 500Gb ssd (boot) 2X 1tb ssd (instance storage) dual sfp+ nic solarflare & intel cards (i ran out of cards)

This config worked fine except errors in keystone and problems with all ceph/cinder/radosgw service

current working config: Maas: ubuntu 20.04 with package install juju as well. Supermicro X10SLL-F E3-1271 V3 32Gb ram. dual 10gb sfp+ network connections in bond0 All systems are connected with dual 10gb nic in bond0

Switches: 2X Dell powerconnect 8024f configured in individual config with 4 port channel trunk ports for inter-switch communication. 1-16 trunk mode with native vlan 6 (pxe vlan) 17-20 trunk only (port channel cables) 21 trunk only (maas) 22 trunk and vlan 6 native for juju and 23 for router trunk only. each system had 1 connection per switch so CTL1 port 1 was plugged into switch 1. ctl 1 port 2 was in switch 2 port 1 etc.

juju controller is same hardware as maas 3x ctl machines : zt systems z1040hf configC dual e5-2680 V4 with 64gb ram and 2 ssd in raid 1 (created by maas) also dual 10gb sfp+ nic 82599ES 5x cmp machines: zt systems z1040hf configC dual e5-2680 V4 with 256gb ram and 2 ssd in raid 1 (created by maas) also dual 10gb sfp+ nic 82599ES

3X Ceph (dev hardware) single cpu supermicro e5-2650 v2 with 32 gb ram 1X 500Gb ssd (boot) 7X 1tb hdd (instance storage) dual intel cards.