Hi guys, I am using docker version of the dcgm-exporter,
when I just started th…e dcgm-exporter container, I can get the profiling metrics well.
After some period of time, I cannot get the profiling metrics like DCGM_FI_PROF_*.
actually, exporter prints the profiling metrics as zero, while other metrics like DCGM_FI_DEV_POWER_USAGE are printed well.
If I restart the container, then the metrics are exported well.
Here is an example
```
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
# TYPE DCGM_FI_DEV_MEM_CLOCK gauge
# HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
# TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
# HELP DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION Total energy consumption since boot (in mJ).
# TYPE DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter
# HELP DCGM_FI_DEV_PCIE_REPLAY_COUNTER Total number of PCIe retries.
# TYPE DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter
# HELP DCGM_FI_DEV_MEM_COPY_UTIL Memory utilization (in %).
# TYPE DCGM_FI_DEV_MEM_COPY_UTIL gauge
# HELP DCGM_FI_DEV_ENC_UTIL Encoder utilization (in %).
# TYPE DCGM_FI_DEV_ENC_UTIL gauge
# HELP DCGM_FI_DEV_DEC_UTIL Decoder utilization (in %).
# TYPE DCGM_FI_DEV_DEC_UTIL gauge
# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
# HELP DCGM_FI_DEV_FB_FREE Framebuffer memory free (in MiB).
# TYPE DCGM_FI_DEV_FB_FREE gauge
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
# HELP DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL Total number of NVLink bandwidth counters for all lanes.
# TYPE DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
# HELP DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS Number of remapped rows for uncorrectable errors
# TYPE DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter
# HELP DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS Number of remapped rows for correctable errors
# TYPE DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter
# HELP DCGM_FI_DEV_ROW_REMAP_FAILURE Whether remapping of rows has failed
# TYPE DCGM_FI_DEV_ROW_REMAP_FAILURE gauge
# HELP DCGM_FI_PROF_GR_ENGINE_ACTIVE Ratio of time the graphics engine is active (in %).
# TYPE DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge
# HELP DCGM_FI_PROF_SM_ACTIVE The ratio of cycles an SM has at least 1 warp assigned (in %).
# TYPE DCGM_FI_PROF_SM_ACTIVE gauge
# HELP DCGM_FI_PROF_SM_OCCUPANCY The ratio of number of warps resident on an SM (in %).
# TYPE DCGM_FI_PROF_SM_OCCUPANCY gauge
# HELP DCGM_FI_PROF_PIPE_TENSOR_ACTIVE Ratio of cycles the tensor (HMMA) pipe is active (in %).
# TYPE DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge
# HELP DCGM_FI_PROF_DRAM_ACTIVE Ratio of cycles the device memory interface is active sending or receiving data (in %).
# TYPE DCGM_FI_PROF_DRAM_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP64_ACTIVE Ratio of cycles the fp64 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP64_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP32_ACTIVE Ratio of cycles the fp32 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP32_ACTIVE gauge
# HELP DCGM_FI_PROF_PIPE_FP16_ACTIVE Ratio of cycles the fp16 pipes are active (in %).
# TYPE DCGM_FI_PROF_PIPE_FP16_ACTIVE gauge
# HELP DCGM_FI_PROF_PCIE_TX_BYTES The number of bytes of active pcie tx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_TX_BYTES counter
# HELP DCGM_FI_PROF_PCIE_RX_BYTES The number of bytes of active pcie rx data including both header and payload.
# TYPE DCGM_FI_PROF_PCIE_RX_BYTES counter
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 1380
DCGM_FI_DEV_MEM_CLOCK{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 41
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 43
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 92.562000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 118893321950
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 10
DCGM_FI_DEV_ENC_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_DEC_UTIL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_FB_FREE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 9939
DCGM_FI_DEV_FB_USED{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 22571
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_SM_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_SM_OCCUPANCY{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP64_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP32_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PIPE_FP16_ACTIVE{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0.000000
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="0",UUID="GPU-45984919-9e67-f2ec-9b84-a96fefa525ff",device="nvidia0"} 0
DCGM_FI_DEV_SM_CLOCK{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 1380
DCGM_FI_DEV_MEM_CLOCK{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 877
DCGM_FI_DEV_MEMORY_TEMP{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 54
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 56
DCGM_FI_DEV_POWER_USAGE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 167.362000
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 129931554972
DCGM_FI_DEV_PCIE_REPLAY_COUNTER{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_MEM_COPY_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 22
DCGM_FI_DEV_ENC_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_DEC_UTIL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_FB_FREE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 30332
DCGM_FI_DEV_FB_USED{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 2178
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_PROF_GR_ENGINE_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_SM_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_SM_OCCUPANCY{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_DRAM_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP64_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP32_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PIPE_FP16_ACTIVE{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0.000000
DCGM_FI_PROF_PCIE_TX_BYTES{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
DCGM_FI_PROF_PCIE_RX_BYTES{gpu="1",UUID="GPU-08abd178-8a91-05c5-6a52-1ab4a05bcdaf",device="nvidia1"} 0
```
as you can see, gpu is now being utilized, but DCGM_FI_PROF_* gives zero.
this is the result of nvidia-smi
```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:04:00.0 Off | 0 |
| N/A 43C P0 96W / 250W | 22571MiB / 32510MiB | 43% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:1B:00.0 Off | 0 |
| N/A 56C P0 168W / 250W | 2178MiB / 32510MiB | 95% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
```
Here is the log for the /var/log/nv-hostengine.log in the dcgm-exporter container
```
2021-05-18 05:28:26.755 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:26.945 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:26.945 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.055 ERROR [1:27] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:9892] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.136 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.136 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 0 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
2021-05-18 05:28:27.326 ERROR [1:28] [PerfWorks] NVPW_DCGM_PeriodicSampler_CPUTrigger_TriggerKeep failed with 22 for deviceIndex 1 [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmLopConfig.cpp:489] [DcgmLopConfig::GetSamples]
2021-05-18 05:28:27.326 ERROR [1:28] Got error -37 from GetSamples() [/workspaces/dcgm-rel_dcgm_2_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1532] [DcgmModuleProfiling::ReadGpuMetrics]
```
the current version I used is the followings:
```
nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04
```
(actually I also used nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
the same thing happened except the profiling metrics are not exported, while newer version prints zero)
The os is
```
CentOS Linux release 7.9.2009 (Core)
Linux cl-platform-gpu02 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
```
the nvidia-driver and gpus are
```
450.51.05
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
```
This is the command I used
```
docker run --gpus all --cap-add CAP_SYS_ADMIN --network host -v /NAS:/NAS -d nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.3.1-ubuntu18.04 --address 0.0.0.0:31400 -f /NAS/dcgm_exporter/dcp-metrics-included-all.csv --address 0.0.0.0:31400
```
and /NAS/dcgm_exporter/dcp-metrics-included-all.csv contains
```
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message
# Clocks,,
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power,,
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE,,
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product),,
# DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
# DCP metrics,,
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES, counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES, counter, The number of bytes of active pcie rx data including both header and payload.
```
Should I do something that I missed?