We installed DCGM on our cluster to collect GPU usage data to the log file for LSF jobs. It is not working at this moment.
I noticed following error message in the dcgm log file:
2021-02-09 16:40:00.009 ERROR [29932:29938] Got bad message size 285212672. Closing connection. [/workspaces/dcgm-rel_dcgm_2_1-postmerge/common/transport/DcgmIpc.cpp:964] [DcgmIpcConnection::ReadMessages]
2021-02-09 16:40:00.009 ERROR [29932:29938] Got error Host engine connection invalid/disconnected from ReadMessages [/workspaces/dcgm-rel_dcgm_2_1-postmerge/common/transport/DcgmIpc.cpp:841] [DcgmIpc::ReadCB]
2021-02-09 16:40:00.009 ERROR [29932:29934] Unknown subcommand: 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/core/DcgmModuleCore.cpp:82] [DcgmModuleCore::ProcessMessage]
Can you help me diagnose and fix? Thanks!