DCGM does not export profile metrics after some period of time

shovsj · June 1, 2021, 4:38am

Hello, I already reported this issue to

query profiling metrics raises an exception

opened 05:30AM - 24 May 21 UTC

closed 10:06PM - 25 Aug 21 UTC

hello, I am new to DCGM, I would like to collect the profiling metrics, but ther…e seems to be several problems. I am using dcgm exporter to collect the profiling metrics, but unfortunately, there are some problems. https://github.com/NVIDIA/gpu-monitoring-tools/issues/189 to check dcgm is properly working, I tested using the following command, and I got some error messages ``` root@cl-platform-gpu01:/# dcgmi dmon -e 1004 # Entity TENSO Id Error setting watches. Result: The third-party Profiling module returned an unrecoverable error ``` I also used the profilertest, and it seems okay ``` root@cl-platform-gpu01:/# /usr/bin/dcgmproftester11 --no-dcgm-validation -t 1004 -d 10 Skipping CreateDcgmGroups() since DCGM validation is disabled Skipping CreateDcgmGroups() since DCGM validation is disabled Skipping CreateDcgmGroups() since DCGM validation is disabled Skipping WatchFields() since DCGM validation is disabled Skipping CreateDcgmGroups() since DCGM validation is disabled Skipping CreateDcgmGroups() since DCGM validation is disabled Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (76460.9 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (76553.6 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83514.9 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83709.5 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (78960.7 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (78831.1 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (81628.5 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (79509.4 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83713.7 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (84599.7 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (79126.2 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (79630.3 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (80746.4 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (80022.6 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83400.0 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83205.9 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (78649.8 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (78736.6 gflops) Worker 1:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (84357.4 gflops) Worker 0:0[1004]: TensorEngineActive: generated ???, dcgm 0.000 (83102.6 gflops) Worker 1:0[1004]: Message: Bus ID 00000000:1B:00.0 mapped to cuda device ID 1 DCGM CudaContext Init completed successfully. CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: 2048 CUDA_VISIBLE_DEVICES: CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT: 80 CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 98304 CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR: 7 CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR: 0 CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: 4096 CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: 877 Max Memory bandwidth: 898048000000 bytes (898.0 GiB) CU_DEVICE_ATTRIBUTE_ECC_SUPPORT: true Worker 0:0[1004]: Message: Bus ID 00000000:04:00.0 mapped to cuda device ID 0 DCGM CudaContext Init completed successfully. CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR: 2048 CUDA_VISIBLE_DEVICES: CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT: 80 CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR: 98304 CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR: 7 CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR: 0 CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH: 4096 CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE: 877 Max Memory bandwidth: 898048000000 bytes (898.0 GiB) CU_DEVICE_ATTRIBUTE_ECC_SUPPORT: true Skipping UnwatchFields() since DCGM validation is disabled ``` the version I used is ``` root@cl-platform-gpu01:/# dcgmi --version dcgmi version: 2.1.4 ``` The os is ``` CentOS Linux release 7.9.2009 (Core) Linux cl-platform-gpu02 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ``` I am using the following docker image ``` nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 ``` the nvidia-driver and gpus are ``` 450.51.05 GPU 0: Tesla V100-PCIE-32GB GPU 1: Tesla V100-PCIE-32GB ``` What should I check to resolve this issue?

and

github.com/NVIDIA/gpu-monitoring-tools

exporter returns no profiling metrics after some period of time

opened 05:46AM - 18 May 21 UTC

shovsj

Hi guys, I am using docker version of the dcgm-exporter, when I just started th…e dcgm-exporter container, I can get the profiling metrics well. After some period of time, I cannot get the profiling metrics like DCGM_FI_PROF_*. actually, exporter prints the profiling metrics as zero, while other metrics like DCGM_FI_DEV_POWER_USAGE are printed well. If I restart the container, then the metrics are exported well. Here is an example ``` # HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz). # TYPE DCGM_FI_DEV_SM_CLOCK gauge # HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz). # TYPE DCGM_FI_DEV_MEM_CLOCK gauge # HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C). # TYPE DCGM_FI_DEV_MEMORY_TEMP gauge # HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C). # TYPE DCGM_FI_DEV_GPU_TEMP gauge # HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W). # TYPE DCGM_FI_DEV_POWER_USAGE gauge # HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ). # TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter # HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries. # TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter # HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %). # TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge # HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %). # TYPE DCGM_FI_DEV_ENC_UTIL gauge # HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %). # TYPE DCGM_FI_DEV_DEC_UTIL gauge # HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered. # TYPE DCGM_FI_DEV_XID_ERRORS gauge # HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB). # TYPE DCGM_FI_DEV_FB_FREE gauge # HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB). # TYPE DCGM_FI_DEV_FB_USED gauge # HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes. # TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter # HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status # TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge # HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors # TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter # HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors # TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter # HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed # TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge # HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %). # TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge # HELP DCGM_FI_PROF_SM_ACTIVE The ratio of cycles an SM has at least 1 warp assigned (in %). # TYPE DCGM_FI_PROF_SM_ACTIVE gauge # HELP DCGM_FI_PROF_SM_OCCUPANCY The ratio of number of warps resident on an SM (in %). # TYPE DCGM_FI_PROF_SM_OCCUPANCY gauge # HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %). # TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge # HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %). # TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge # HELP DCGM_FI_PROF_PIPE_FP64_ACTIVE Ratio of cycles the fp64 pipes are active (in %). # TYPE DCGM_FI_PROF_PIPE_FP64_ACTIVE gauge # HELP DCGM_FI_PROF_PIPE_FP32_ACTIVE Ratio of cycles the fp32 pipes are active (in %). # TYPE DCGM_FI_PROF_PIPE_FP32_ACTIVE gauge # HELP DCGM_FI_PROF_PIPE_FP16_ACTIVE Ratio of cycles the fp16 pipes are active (in %). # TYPE DCGM_FI_PROF_PIPE_FP16_ACTIVE gauge # HELP DCGM_FI_PROF_PCIE_TX_BYTES The number of bytes of active pcie tx data including both header and payload. # TYPE DCGM_FI_PROF_PCIE_TX_BYTES counter # HELP DCGM_FI_PROF_PCIE_RX_BYTES The number of bytes of active pcie rx data including both header and payload. # TYPE DCGM_FI_PROF_PCIE_RX_BYTES counter DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 1380 DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 877 DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 41 DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 43 DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 92.562000 DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 118893321950 DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0 DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 10 DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0 DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0 DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0 DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 9939 DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 22571 DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0 DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0 DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000 DCGM_FI_PROF_SM_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000 DCGM_FI_PROF_SM_OCCUPANCY{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000 DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000 DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000 DCGM_FI_PROF_PIPE_FP64_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000 DCGM_FI_PROF_PIPE_FP32_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000 DCGM_FI_PROF_PIPE_FP16_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000 DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0 DCGM_FI_PROF_PCIE_RX_BYTES{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0 DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 1380 DCGM_FI_DEV_MEM_CLOCK{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 877 DCGM_FI_DEV_MEMORY_TEMP{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 54 DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 56 DCGM_FI_DEV_POWER_USAGE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 167.362000 DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 129931554972 DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0 DCGM_FI_DEV_MEM_COPY_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 22 DCGM_FI_DEV_ENC_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0 DCGM_FI_DEV_DEC_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0 DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0 DCGM_FI_DEV_FB_FREE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 30332 DCGM_FI_DEV_FB_USED{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 2178 DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0 DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0 DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000 DCGM_FI_PROF_SM_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000 DCGM_FI_PROF_SM_OCCUPANCY{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000 DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000 DCGM_FI_PROF_DRAM_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000 DCGM_FI_PROF_PIPE_FP64_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000 DCGM_FI_PROF_PIPE_FP32_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000 DCGM_FI_PROF_PIPE_FP16_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000 DCGM_FI_PROF_PCIE_TX_BYTES{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0 DCGM_FI_PROF_PCIE_RX_BYTES{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0 ``` as you can see, gpu is now being utilized, but DCGM_FI_PROF_* gives zero. this is the result of nvidia-smi ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... Off | 00000000:04:00.0 Off | 0 | | N/A 43C P0 96W / 250W | 22571MiB / 32510MiB | 43% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-PCIE... Off | 00000000:1B:00.0 Off | 0 | | N/A 56C P0 168W / 250W | 2178MiB / 32510MiB | 95% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ ``` Here is the log for the /var/log/nv-hostengine.log in the dcgm-exporter container ``` 2021-05-18 05:28:26.755 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics] 2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples] 2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics] 2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples] 2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics] 2021-05-18 05:28:27.055 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal] 2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples] 2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics] 2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples] 2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics] 2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples] 2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics] 2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples] 2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics] ``` the current version I used is the followings: ``` nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 ``` (actually I also used nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04 the same thing happened except the profiling metrics are not exported, while newer version prints zero) The os is ``` CentOS Linux release 7.9.2009 (Core) Linux cl-platform-gpu02 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ``` the nvidia-driver and gpus are ``` 450.51.05 GPU 0: Tesla V100-PCIE-32GB GPU 1: Tesla V100-PCIE-32GB ``` This is the command I used ``` docker run --gpus all --cap-add CAP_SYS_ADMIN --network host -v /NAS:/NAS -d nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 --address 0.0.0.0:31400 -f /NAS/dcgm_exporter/dcp-metrics-included-all.csv --address 0.0.0.0:31400 ``` and /NAS/dcgm_exporter/dcp-metrics-included-all.csv contains ``` # Format,, # If line starts with a '#' it is considered a comment,, # DCGM FIELD, Prometheus metric type, help message # Clocks,, DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz). DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz). # Temperature,, DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C). DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C). # Power,, DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W). DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ). # PCIE,, # DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML. # DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML. DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries. # Utilization (the sample period varies depending on the product),, # DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %). DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %). DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %). DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %). # Errors and violations,, DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered. # DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us). # DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us). # DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us). # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us). # DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us). # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us). # Memory usage,, DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB). DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB). # ECC,, # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors. # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors. # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors. # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors. # Retired pages,, # DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors. # DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors. # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement. # NVLink,, # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors. # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors. # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries. # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors. DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes. # DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload. # VGPU License status,, DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status # Remapped rows,, DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed # DCP metrics,, DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %). DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %). DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %). DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %). DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %). DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %). DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %). DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %). DCGM_FI_PROF_PCIE_TX_BYTES, counter, The number of bytes of active pcie tx data including both header and payload. DCGM_FI_PROF_PCIE_RX_BYTES, counter, The number of bytes of active pcie rx data including both header and payload. ``` Should I do something that I missed?

but I couldn’t get any response, so I create this topic here.

I would like to create a monitoring system based on DCGM and DCGM exporter, but regarding the profiling metrics, there is a problem. I could install dcgm and dcgm exporter properly based on docker, or directly to the host.

But whenever I tried to collect the profiling metrics, I get the following error messages

2021-05-18 05:28:26.755 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.055 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]

So I tried to use dgcmi directly instead of using dcgm exporter,
I entered the following command:

dcgmi dmon -e 155,1001,1004

I could get the metrics for a while but after a while, profiling metrics started to be zero.

As you can see in the image, training job is running so power usage is more than 100W but the profiling metrics are zero after some period.

And I mentioned in the reference issue, if I reconnected to the nv-hostengine using the same command,
then I could get the metrics properly for a while.

Is this a bug? or should I do something missing?