Clagd won't start

Hey guys !

I’ve an issue I can’t understand for now.

We had a pair of spectrum MSN3700 switches connected to each other with mlag.

After an issue with on of the 2 which needed a complete reinstallation the first one isn’t able to start the clagd service anymore.

I got the following logs from journalctl :

Aug 11 10:11:21 cumulus systemd[1]: Starting Cumulus Linux Multi-Chassis LACP Bonding Daemon…
Aug 11 10:11:21 cumulus clagd[3492775]: Clag Initializing
Aug 11 10:11:21 cumulus clagd[3492775]: Cleanup is executing.
Aug 11 10:11:21 cumulus clagd[3492850]: RTNETLINK answers: No such device
Aug 11 10:11:21 cumulus clagd[3492775]: Cleanup is finished
Aug 11 10:11:22 cumulus clagd[3492774]: Beginning execution of clagd version 1.4.0
Aug 11 10:11:22 cumulus clagd[3492774]: Invoked with: /usr/sbin/clagd --daemon linklocal peerlink.4094 44:38:39:BE:EF:BA --priority 32768 --backupIp 172.16.20.252 --initDelay 180 --debug 0xfffffffff
Aug 11 10:11:22 cumulus clagd[3492774]: macAddr = 44:38:39:be:ef:ba
Aug 11 10:11:23 cumulus clagd[3492774]: Allowing duplicate LACP partner MACs
Aug 11 10:11:23 cumulus clagd[3492774]: Role is now secondary
Aug 11 10:11:24 cumulus clagd[3492774]: Thread to receive from CSU Manager – Started
Aug 11 10:11:31 cumulus clagd[3492774]: [Thread-1] -----Thread 140398500529920 “CS Manager PUB/SUB” hangs -----
File “/usr/lib/python3.7/threading.py”, line 885, in _bootstrap
self._bootstrap_inner()
File “/usr/lib/python3.7/threading.py”, line 917, in _bootstrap_inner
self.run()
File “/usr/lib/python3.7/threading.py”, line 865, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python3/dist-packages/clag/clagdcsu.py”, line 266, in handlePubSubMesgs
self.csuLock.acquire()
Aug 11 10:11:31 cumulus clagd[3492774]: [Thread-1] ---------Thread 140398577858368 “MainThread” hangs ---------
File “/usr/sbin/clagd”, line 7119, in
main()
File “/usr/sbin/clagd”, line 7067, in main
ClagRun(nlm)
File “/usr/sbin/clagd”, line 6993, in ClagRun
csu_client = clagdcsu.CSUClient(Log, Intf, Parser)
File “/usr/lib/python3/dist-packages/clag/clagdcsu.py”, line 55, in init
self.sendLoadCompleteMsg()
File “/usr/lib/python3/dist-packages/clag/clagdcsu.py”, line 96, in sendLoadCompleteMsg
resp = self.comm_socket.recv()

Has anyone seen this ? I’m on cumulus 5.6. How can I pinpoint where the issue comes from ?

I was able during my investigation to run the clagd service from cli without issues but every time I try with systemd it doesn’t work.

I havent seen such problem before. Assuming configs are fine. Is there any user config or issue with csmgrd daemon, or the switch in general (No other issue with boot up)?
Anything more from clagd.log? You have clagd debug enabled so there is likely a lot of output there. You may want to turn off/down the debug and see.
I think this probably needs a deeper investigation, so it might be best to collect a cl-support and open a Support case.