ceph-mon goes CrashLoopBackOff in Rocky Linux 9.2 (Blue Onyx) #12687

gowthamsadasivam · 2023-08-08T19:56:01Z

gowthamsadasivam
Aug 8, 2023

I am unable to get rook-ceph cluster working in my new kubernetes cluster. The ceph-mon pods are keep on crashing with CrashLoopBackOff error. The exact same setup with Rocky Linux 8.8 is working good.

I have a Kubernetes cluster up & running with the below mentioned configurations:

Host OS: Rocky Linux 9.2
Linux Kernel: 5.14.0-284.25.1.el9_2.x86_64
Number of Nodes: 5
Kubernetes version: 1.24.15
Docker version: 20.10.24
Kubernetes Distro: RKE

Followed official documentation to install and deploy rook-ceph:

installed rook-ceph operator - helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph
# k -n rook-ceph get pods -l "app=rook-ceph-operator"
Output:

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-54f6c4c4fc-g5r6x   1/1     Running   0          29m

created a ceph cluster using the official example yaml file - https://github.com/rook/rook/blob/master/deploy/examples/cluster.yaml
k apply -f ./cluster.yaml
Checking status of the cluster creation:
# k get cephclusters.ceph.rook.io -n rook-ceph
Output:

NAME        DATADIRHOSTPATH   MONCOUNT   AGE     PHASE   MESSAGE                            HEALTH       EXTERNAL   FSID
rook-ceph   /var/lib/rook     3          29m   Ready   Failed to configure ceph cluster   HEALTH_ERR

By checking the pods in rook-ceph ns, I figured the monitor daemons are CrashLoopBackOff :

# kubectl get po -n rook-ceph | grep mon

rook-ceph-mon-a-8666c64c84-dvnkg                                  1/2     Running            1 (12s ago)      118s
rook-ceph-mon-b-6bdd8b646f-c6frj                                  1/2     CrashLoopBackOff   49 (2m36s ago)   29m
rook-ceph-mon-c-84c874775f-hb5rj                                  2/2     Running            0                29m

After few minutes:

# k get po -n rook-ceph | grep mon

rook-ceph-mon-a-8666c64c84-7bz8w                                  1/2     CrashLoopBackOff   7 (18s ago)      16m
rook-ceph-mon-b-6bdd8b646f-s5x4f                                  1/2     CrashLoopBackOff   11 (3m48s ago)   36m
rook-ceph-mon-c-84c874775f-hb5rj                                  2/2     Running            0                36m

Also noticed couple of the osd also were restarting :

# k get po -n rook-ceph | grep osd

rook-ceph-osd-0-56c8b4644d-t4lhq                                  2/2     Running            0                37m
rook-ceph-osd-1-86659d7c7f-pkrrn                                  2/2     Running            0                37m
rook-ceph-osd-2-776dcd969f-5shmh                                  2/2     Running            0                37m
rook-ceph-osd-3-7f98f79dc-wjwb9                                   1/2     Running            1 (27m ago)     37m
rook-ceph-osd-4-5b997dcc4-5gzsm                                   1/2     Running            2 (25m ago)      37m

Last few lines of the ceph-mon pod logs which are crashing :

Uptime(secs): 0.0 total, 0.0 interval
Flush(GB): cumulative 0.000, interval 0.000
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.00 GB write, 0.15 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [default] **

debug 2023-08-08T16:51:09.238+0000 7f3788744b80  0 starting mon.b rank 1 at public addrs [v2:10.43.89.237:3300/0,v1:10.43.89.237:6789/0] at bind addrs [v2:10.42.240.3:3300/0,v1:10.42.240.3:6789/0] mon_data /var/lib/ceph/mon/ceph-b fsid 4119beaf-6a23-4fa2-afa2-62f0974c0409
debug 2023-08-08T16:51:09.239+0000 7f3788744b80  1 mon.b@-1(???) e3 preinit fsid 4119beaf-6a23-4fa2-afa2-62f0974c0409
debug 2023-08-08T16:51:09.239+0000 7f3788744b80  0 mon.b@-1(???).mds e1 new map
debug 2023-08-08T16:51:09.239+0000 7f3788744b80  0 mon.b@-1(???).mds e1 print_map
e1
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: -1
 
No filesystems configured

debug 2023-08-08T16:51:09.239+0000 7f3788744b80  0 mon.b@-1(???).osd e15 crush map has features 3314932999778484224, adjusting msgr requires
debug 2023-08-08T16:51:09.239+0000 7f3788744b80  0 mon.b@-1(???).osd e15 crush map has features 288514050185494528, adjusting msgr requires
debug 2023-08-08T16:51:09.239+0000 7f3788744b80  0 mon.b@-1(???).osd e15 crush map has features 288514050185494528, adjusting msgr requires
debug 2023-08-08T16:51:09.239+0000 7f3788744b80  0 mon.b@-1(???).osd e15 crush map has features 288514050185494528, adjusting msgr requires
debug 2023-08-08T16:51:09.239+0000 7f3788744b80  1 mon.b@-1(???).paxosservice(auth 1..67) refresh upgraded, format 0 -> 3
debug 2023-08-08T16:51:09.241+0000 7f3788744b80  0 mon.b@-1(probing) e3  my rank is now 1 (was -1)
debug 2023-08-08T16:51:09.852+0000 7f377b99e700  0 log_channel(cluster) log [INF] : mon.b calling monitor election
debug 2023-08-08T16:51:09.852+0000 7f377b99e700  1 paxos.1).electionLogic(388) init, last seen epoch 388
debug 2023-08-08T16:51:09.858+0000 7f377b99e700  1 mon.b@1(electing) e3 collect_metadata :  no unique device id for : fallback method has no model nor serial
debug 2023-08-08T16:51:14.858+0000 7f377e1a3700  1 paxos.1).electionLogic(389) init, last seen epoch 389, mid-election, bumping
debug 2023-08-08T16:51:14.888+0000 7f377e1a3700  1 mon.b@1(electing) e3 collect_metadata :  no unique device id for : fallback method has no model nor serial
debug 2023-08-08T16:51:19.901+0000 7f377e1a3700  0 log_channel(cluster) log [INF] : mon.b is new leader, mons b,c in quorum (ranks 1,2)
debug 2023-08-08T16:51:20.095+0000 7f377b99e700  0 log_channel(cluster) log [DBG] : monmap e3: 3 mons at {a=[v2:10.43.134.59:3300/0,v1:10.43.134.59:6789/0],b=[v2:10.43.89.237:3300/0,v1:10.43.89.237:6789/0],c=[v2:10.43.159.48:3300/0,v1:10.43.159.48:6789/0]} removed_ranks: {}
debug 2023-08-08T16:51:20.108+0000 7f377b99e700  1 mon.b@1(leader) e3 collect_metadata :  no unique device id for : fallback method has no model nor serial
debug 2023-08-08T16:51:20.108+0000 7f377b99e700  0 log_channel(cluster) log [DBG] : fsmap 
debug 2023-08-08T16:51:20.108+0000 7f377b99e700  0 log_channel(cluster) log [DBG] : osdmap e15: 5 total, 3 up, 5 in
debug 2023-08-08T16:51:20.121+0000 7f377b99e700  0 log_channel(cluster) log [DBG] : mgrmap e13: b(active, starting, since 3h)
debug 2023-08-08T16:51:20.201+0000 7f377b99e700  0 mon.b@1(leader) e3 handle_command mon_command({"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["3"]} v 0) v1
debug 2023-08-08T16:51:20.201+0000 7f377b99e700  0 log_channel(audit) log [INF] : from='osd.3 ' entity='osd.3' cmd=[{"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["3"]}]: dispatch
debug 2023-08-08T16:51:20.201+0000 7f377b99e700  0 mon.b@1(leader) e3 handle_command mon_command({"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["4"]} v 0) v1
debug 2023-08-08T16:51:20.201+0000 7f377b99e700  0 log_channel(audit) log [INF] : from='osd.4 ' entity='osd.4' cmd=[{"prefix": "osd crush set-device-class", "class": "ssd", "ids": ["4"]}]: dispatch
debug 2023-08-08T16:51:20.201+0000 7f377b99e700  0 mon.b@1(leader) e3 handle_command mon_command({"prefix": "osd pool create", "format": "json", "pool": ".mgr", "pg_num": 1, "pg_num_min": 1, "pg_num_max": 32} v 0) v1
debug 2023-08-08T16:51:20.201+0000 7f377b99e700  0 log_channel(audit) log [INF] : from='mgr.24227 ' entity='mgr.b' cmd=[{"prefix": "osd pool create", "format": "json", "pool": ".mgr", "pg_num": 1, "pg_num_min": 1, "pg_num_max": 32}]: dispatch
debug 2023-08-08T16:52:18.926+0000 7f37811a9700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
debug 2023-08-08T16:52:18.926+0000 7f37811a9700 -1 mon.b@1(leader) e3 *** Got Signal Terminated ***
debug 2023-08-08T16:52:18.926+0000 7f37811a9700  1 mon.b@1(leader) e3 shutdown

Not sure if the issue is caused by this line:

debug 2023-08-08T16:51:20.108+0000 7f377b99e700  1 mon.b@1(leader) e3 collect_metadata :  no unique device id for : fallback method has no model nor serial

Last few lines of ceph-operator logs :

2023-08-08 17:22:39.134787 I | exec: exec timeout waiting for process ceph to return. Sending interrupt signal to the process
2023-08-08 17:22:39.145788 E | op-config: failed to run command ceph [config assimilate-conf -i /tmp/3988246856 -o /tmp/3988246856.out]
2023-08-08 17:22:39.145801 E | op-config: failed to apply ceph settings:
2023-08-08 17:22:39.174929 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to set Rook and/or user-defined Ceph config options after forcefully updating the existing mons: failed to apply default Ceph configurations: failed to set all keys: failed to set ceph config in the centralized mon configuration database; output: Interrupted
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1326, in <module>
    retval = main()
  File "/usr/bin/ceph", line 1272, in main
    outf.write(outbuf)
TypeError: a bytes-like object is required, not 'str': exit status 1
2023-08-08 17:22:54.257902 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:23:37.542835 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:23:52.648687 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:23:54.341753 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:24:40.567237 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:24:54.442658 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:25:07.293029 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:25:07.742814 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:25:10.772635 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:25:22.846933 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:25:39.973289 I | op-mon: mon "a" is back in quorum, removed from mon out timeout list
2023-08-08 17:25:39.973302 I | op-mon: marking mon "b" out of quorum
2023-08-08 17:25:39.973411 W | cephclient: skipping adding mon "b" to config file, detected out of quorum
2023-08-08 17:25:39.976373 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-08-08 17:25:39.976436 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-08-08 17:25:39.990982 W | op-mon: mon "b" not found in quorum, waiting for timeout (599 seconds left) before failover
2023-08-08 17:26:10.098307 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:26:37.944116 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:26:38.689199 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:26:40.067401 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:27:25.569460 W | op-mon: mon "b" not found in quorum, waiting for timeout (494 seconds left) before failover
2023-08-08 17:27:56.759934 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:28:11.847905 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:28:17.051328 I | op-mon: marking mon "a" out of quorum
2023-08-08 17:28:17.051462 W | cephclient: skipping adding mon "a" to config file, detected out of quorum
2023-08-08 17:28:17.051467 W | cephclient: skipping adding mon "b" to config file, detected out of quorum
2023-08-08 17:28:17.058243 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-08-08 17:28:17.058301 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-08-08 17:28:17.071754 W | op-mon: mon "a" not found in quorum, waiting for timeout (599 seconds left) before failover
2023-08-08 17:28:17.071760 I | op-mon: marking mon "b" back in quorum
2023-08-08 17:28:17.071813 W | cephclient: skipping adding mon "a" to config file, detected out of quorum
2023-08-08 17:28:17.073196 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-08-08 17:28:17.073239 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-08-08 17:28:17.082864 I | op-mon: mon "b" is back in quorum, removed from mon out timeout list
2023-08-08 17:29:17.162542 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:29:26.941525 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:29:42.044174 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:29:47.989904 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:30:14.112555 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:30:16.526420 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:30:17.242029 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:30:57.143773 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:31:12.247097 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:31:16.979717 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:31:17.341680 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:31:44.614230 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:32:17.443286 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:32:27.342467 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:32:42.444800 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:33:17.542117 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:33:57.543135 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:34:12.646549 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:34:17.643146 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:34:54.572527 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:35:03.173330 W | op-mon: mon "a" not found in quorum, waiting for timeout (193 seconds left) before failover
2023-08-08 17:35:20.577572 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:35:23.271193 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:35:27.744472 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:35:42.845016 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:36:03.254556 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:36:24.039127 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:36:48.766270 W | op-mon: mon "a" not found in quorum, waiting for timeout (88 seconds left) before failover
2023-08-08 17:36:51.905578 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:36:54.693888 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2023-08-08 17:36:54.698697 I | ceph-spec: parsing mon endpoints: a=10.43.134.59:6789,b=10.43.89.237:6789,c=10.43.159.48:6789
2023-08-08 17:36:54.730702 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v17.2.6...
2023-08-08 17:36:56.136278 I | ceph-spec: detected ceph image version: "17.2.6-0 quincy"
2023-08-08 17:36:56.136289 I | ceph-cluster-controller: validating ceph version from provided image
2023-08-08 17:36:56.139149 I | ceph-spec: parsing mon endpoints: a=10.43.134.59:6789,b=10.43.89.237:6789,c=10.43.159.48:6789
2023-08-08 17:36:56.146479 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-08-08 17:36:56.146556 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-08-08 17:37:11.245889 E | ceph-cluster-controller: failed to get ceph daemons versions, this typically happens during the first cluster initialization. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:37:11.245904 I | ceph-cluster-controller: cluster "rook-ceph": version "17.2.6-0 quincy" detected for image "quay.io/ceph/ceph:v17.2.6"
2023-08-08 17:37:11.294641 I | op-mon: start running mons
2023-08-08 17:37:11.297498 I | ceph-spec: parsing mon endpoints: a=10.43.134.59:6789,b=10.43.89.237:6789,c=10.43.159.48:6789
2023-08-08 17:37:11.311021 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.43.134.59:6789","10.43.89.237:6789","10.43.159.48:6789"],"namespace":""}] data:a=10.43.134.59:6789,b=10.43.89.237:6789,c=10.43.159.48:6789 mapping:{"node":{"a":{"Name":"fleet05.azzzzzzza.dvl","Hostname":"fleet05.azzzzzzza.dvl","Address":"192.168.100.6"},"b":{"Name":"fleet03.azzzzzzza.dvl","Hostname":"fleet03.azzzzzzza.dvl","Address":"192.168.100.4"},"c":{"Name":"fleet02.azzzzzzza.dvl","Hostname":"fleet02.azzzzzzza.dvl","Address":"192.168.100.3"}}} maxMonId:2 outOfQuorum:]
2023-08-08 17:37:11.485118 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-08-08 17:37:11.485212 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-08-08 17:37:13.086503 I | op-mon: targeting the mon count 3
2023-08-08 17:37:13.097500 I | op-config: applying ceph settings:
[global]
mon allow pool delete   = true
mon cluster log file    = 
mon allow pool size one = true
2023-08-08 17:37:28.097938 I | exec: exec timeout waiting for process ceph to return. Sending interrupt signal to the process
2023-08-08 17:37:28.099303 E | op-config: failed to open assimilate output file /tmp/2943816426.out. open /tmp/2943816426.out: no such file or directory
2023-08-08 17:37:28.099316 E | op-config: failed to run command ceph [config assimilate-conf -i /tmp/2943816426 -o /tmp/2943816426.out]
2023-08-08 17:37:28.099322 E | op-config: failed to apply ceph settings:
2023-08-08 17:37:28.099370 W | op-mon: failed to set Rook and/or user-defined Ceph config options before starting mons; will retry after starting mons. failed to apply default Ceph configurations: failed to set all keys: failed to set ceph config in the centralized mon configuration database; output: Cluster connection aborted: exit status 1
2023-08-08 17:37:28.099373 I | op-mon: checking for basic quorum with existing mons
2023-08-08 17:37:28.148092 I | op-mon: mon "b" cluster IP is 10.43.89.237
2023-08-08 17:37:28.176492 I | op-mon: mon "c" cluster IP is 10.43.159.48
2023-08-08 17:37:28.503174 I | op-mon: mon "a" cluster IP is 10.43.134.59
2023-08-08 17:37:29.114714 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["10.43.159.48:6789","10.43.134.59:6789","10.43.89.237:6789"],"namespace":""}] data:a=10.43.134.59:6789,b=10.43.89.237:6789,c=10.43.159.48:6789 mapping:{"node":{"a":{"Name":"fleet05.azzzzzzza.dvl","Hostname":"fleet05.azzzzzzza.dvl","Address":"192.168.100.6"},"b":{"Name":"fleet03.azzzzzzza.dvl","Hostname":"fleet03.azzzzzzza.dvl","Address":"192.168.100.4"},"c":{"Name":"fleet02.azzzzzzza.dvl","Hostname":"fleet02.azzzzzzza.dvl","Address":"192.168.100.3"}}} maxMonId:2 outOfQuorum:]
2023-08-08 17:37:29.702651 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-08-08 17:37:29.702775 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-08-08 17:37:30.123623 I | op-mon: deployment for mon rook-ceph-mon-b already exists. updating if needed
2023-08-08 17:37:30.128833 I | op-k8sutil: deployment "rook-ceph-mon-b" did not change, nothing to update
2023-08-08 17:37:30.128842 I | op-mon: waiting for mon quorum with [b c a]
2023-08-08 17:37:30.711571 I | op-mon: mons running: [b c a]
2023-08-08 17:37:50.809871 I | op-mon: mons running: [b c a]
2023-08-08 17:37:59.060626 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:38:10.902726 I | op-mon: mons running: [b c a]
2023-08-08 17:38:14.146174 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:38:30.992733 I | op-mon: mons running: [b c a]
2023-08-08 17:38:31.539021 I | op-mon: Monitors in quorum: [a c]
2023-08-08 17:38:31.543878 I | op-mon: deployment for mon rook-ceph-mon-c already exists. updating if needed
2023-08-08 17:38:31.548763 I | op-k8sutil: deployment "rook-ceph-mon-c" did not change, nothing to update
2023-08-08 17:38:31.548771 I | op-mon: waiting for mon quorum with [b c a]
2023-08-08 17:38:31.571366 I | op-mon: mons running: [b c a]
2023-08-08 17:38:32.141714 I | op-mon: Monitors in quorum: [a c]
2023-08-08 17:38:32.146556 I | op-mon: deployment for mon rook-ceph-mon-a already exists. updating if needed
2023-08-08 17:38:32.151075 I | op-k8sutil: deployment "rook-ceph-mon-a" did not change, nothing to update
2023-08-08 17:38:32.151084 I | op-mon: waiting for mon quorum with [b c a]
2023-08-08 17:38:32.173933 I | op-mon: mons running: [b c a]
2023-08-08 17:38:32.745103 I | op-mon: Monitors in quorum: [a c]
2023-08-08 17:38:32.745117 I | op-mon: mons created: 3
2023-08-08 17:38:33.343562 I | op-mon: waiting for mon quorum with [b c a]
2023-08-08 17:38:33.368968 I | op-mon: mons running: [b c a]
2023-08-08 17:38:33.935664 I | op-mon: Monitors in quorum: [a c]
2023-08-08 17:38:33.935870 I | op-config: applying ceph settings:
[global]
mon allow pool delete   = true
mon cluster log file    = 
mon allow pool size one = true
2023-08-08 17:38:48.936914 I | exec: exec timeout waiting for process ceph to return. Sending interrupt signal to the process
2023-08-08 17:38:48.948127 E | op-config: failed to run command ceph [config assimilate-conf -i /tmp/320294344 -o /tmp/320294344.out]
2023-08-08 17:38:48.948135 E | op-config: failed to apply ceph settings:
2023-08-08 17:38:48.976648 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph monitors: failed to set Rook and/or user-defined Ceph config options after forcefully updating the existing mons: failed to apply default Ceph configurations: failed to set all keys: failed to set ceph config in the centralized mon configuration database; output: Interrupted
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1326, in <module>
    retval = main()
  File "/usr/bin/ceph", line 1272, in main
    outf.write(outbuf)
TypeError: a bytes-like object is required, not 'str': exit status 1
2023-08-08 17:39:04.058201 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1
2023-08-08 17:39:29.241988 E | ceph-cluster-controller: failed to get ceph status. failed to get status. . timed out: exit status 1
2023-08-08 17:39:44.345777 E | ceph-cluster-controller: failed to get ceph daemons versions. failed to run 'ceph versions'. . timed out: exit status 1
2023-08-08 17:40:00.314666 I | op-k8sutil: ROOK_WATCH_FOR_NODE_FAILURE="true" (default)
2023-08-08 17:40:04.142966 W | op-mon: failed to check mon health. failed to get mon quorum status: mon quorum status failed: exit status 1

Anybody faced similar issue with Rocky Linux 9.2 or similar alternative OS? As I mentioned in the beginning of the post, I have the exact same setup working good with Rocky Linux 8.8 host OS nodes.

Found a similar issue (an year old): #10110
Tried adding the solution provided, i.e., setting the LimitNOFILE property to the value of 1048576 instead of infinity .

# cat /etc/systemd/system/docker.service.d/LimitNOFILE.conf
[Service]
LimitNOFILE=1048576

reloaded systemd daemon and restarted docker systemd service and redeployed the Kubernetes cluster.

But even after this change, I end up reaching the same issue mentioned above.

I really appreciate, any help pointing out what went wrong with the above setup.

Thanks in advance.

travisn · 2023-08-09T17:36:32Z

travisn
Aug 9, 2023
Maintainer

The mon restart seems caused by something outside the mon. This is commonly caused by the liveness probe. Does a kubectl describe of the mon deployment indicate the liveness probe failed, or any other details?

debug 2023-08-08T16:52:18.926+0000 7f37811a9700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
debug 2023-08-08T16:52:18.926+0000 7f37811a9700 -1 mon.b@1(leader) e3 *** Got Signal Terminated ***

Since the third mon is not restarting, assuming they are all on the same platform, it wouldn't seem specific to Rocky linux 9.2.

0 replies

zhangdeshuai1999 · 2024-03-20T09:14:41Z

zhangdeshuai1999
Mar 20, 2024

I have the same problem. How's it going

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ceph-mon goes CrashLoopBackOff in Rocky Linux 9.2 (Blue Onyx) #12687

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

ceph-mon goes CrashLoopBackOff in Rocky Linux 9.2 (Blue Onyx) #12687

gowthamsadasivam Aug 8, 2023

Replies: 2 comments

travisn Aug 9, 2023 Maintainer

zhangdeshuai1999 Mar 20, 2024

gowthamsadasivam
Aug 8, 2023

travisn
Aug 9, 2023
Maintainer

zhangdeshuai1999
Mar 20, 2024