DCGM bad message size

We installed DCGM on our cluster to collect GPU usage data to the log file for LSF jobs. It is not working at this moment.

I noticed following error message in the dcgm log file:

2021-02-09 16:40:00.009 ERROR [29932:29938] Got bad message size 285212672. Closing connection. [/workspaces/dcgm-rel_dcgm_2_1-postmerge/common/transport/DcgmIpc.cpp:964] [DcgmIpcConnection::ReadMessages]

2021-02-09 16:40:00.009 ERROR [29932:29938] Got error Host engine connection invalid/disconnected from ReadMessages [/workspaces/dcgm-rel_dcgm_2_1-postmerge/common/transport/DcgmIpc.cpp:841] [DcgmIpc::ReadCB]

2021-02-09 16:40:00.009 ERROR [29932:29934] Unknown subcommand: 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/modules/core/DcgmModuleCore.cpp:82] [DcgmModuleCore::ProcessMessage]

Can you help me diagnose and fix? Thanks!

Dear @jiang.dansha ,
Could you please confirm if you are using DRIVE PX2 platform? If not please post the issue in relevant forum.

We have V100 GPU, Driver Version: 450.51.06 CUDA Version: 11.0

any progress? I got the same problem, I run DCGM 2.1.4 in the host and tried to run a dcgm-exporter docker image nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04
I gave -r localhost:5555 and got the same messages

Dear @shovsj,
This forum is intended for DRIVE PX2 platform related queries. Please post your query in relevant forum to get attention

Thanks, I just googled the error messages and I just ran into this topic. I’m not using DRIVE PX2, so forget about my question. Sorry but I 'm just new to here, so could you suggest, if any, the DCGM related forum? then I can maybe create a new topic in the forum.