DCGM does not export profile metrics after some period of time

Hello, I already reported this issue to

and

but I couldn’t get any response, so I create this topic here.

I would like to create a monitoring system based on DCGM and DCGM exporter, but regarding the profiling metrics, there is a problem. I could install dcgm and dcgm exporter properly based on docker, or directly to the host.

But whenever I tried to collect the profiling metrics, I get the following error messages

2021-05-18 05:28:26.755 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.055 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]

So I tried to use dgcmi directly instead of using dcgm exporter,
I entered the following command:

dcgmi dmon -e 155,1001,1004

I could get the metrics for a while but after a while, profiling metrics started to be zero.


As you can see in the image, training job is running so power usage is more than 100W but the profiling metrics are zero after some period.

And I mentioned in the reference issue, if I reconnected to the nv-hostengine using the same command,
then I could get the metrics properly for a while.
image

Is this a bug? or should I do something missing?