Hello. I am trying to add custom fields to DCGM, but any additional field other than the defaults is returning 0.
I tried modifying both the Python as well as C++ examples here:
/usr/local/dcgm/bindings/DcgmReaderExample.py
/usr/local/dcgm/sdk_samples/c_src/field_value_sample/field_value_sample.cpp
Also looked at the documentation here: https://docs.nvidia.com/datacenter/dcgm/2.0/pdf/DCGM_User_Guide.pdf section 4.1.4 Additional Customization
dcgmi does show the custom fields:
dcgmi dmon -l | grep nvswitch_latency_histogram
nvswitch_latency_histogram_low_p00 SLL00 700
nvswitch_latency_histogram_med_p00 SLM00 701
nvswitch_latency_histogram_high_p00 SHL00 702
nvswitch_latency_histogram_max_p00 SLX00 703
nvswitch_latency_histogram_low_p01 SLL01 704
nvswitch_latency_histogram_med_p01 SLM01 705
nvswitch_latency_histogram_high_p01 SLH01 706
nvswitch_latency_histogram_max_p01 SLX01 707
...
dcgmi dmon -l | grep nvlink_bandwidth
nvlink_bandwidth_l0 NBWL0 440
nvlink_bandwidth_l1 NBWL1 441
nvlink_bandwidth_l2 NBWL2 442
nvlink_bandwidth_l3 NBWL3 443
nvlink_bandwidth_l4 NBWL4 444
nvlink_bandwidth_l5 NBWL5 445
But the value returned for these additional fields is always 0. Here’s the output of dcgmi during an allreduce test on 8GPUs with message size 64MB and 5000 iterations. The non-zero values returned below are for the default fields of transmitted bytes, received bytes and Nvlink total BW. Any other custom field added always returns 0.
dcgmi dmon -e 1011,1012,449,440,441 -d 100
# Entity NVLTX NVLRX NBWLT NBWL0 NBWL1
Id MB/s^T MB/s^T MB/s^T
GPU 0 205082371557 205696841961 401614 0 0
GPU 1 205084045289 205084477338 401028 0 0
GPU 2 205092980007 205092399581 401062 0 0
GPU 3 205081561723 205080801023 401063 0 0
GPU 4 205098144886 205098765138 401190 0 0
GPU 5 205392418343 205085263573 401385 0 0
GPU 6 206086025936 205473712850 402350 0 0
GPU 7 205770384146 206069539258 402396 0 0
GPU 0 205053821100 205671712369 401158 0 0
GPU 1 205088325022 205088620733 400435 0 0
GPU 2 204991967373 205028901802 400522 0 0
GPU 3 204993510903 205034220882 400436 0 0
GPU 4 204919353085 204921515699 400405 0 0
GPU 5 205359401368 205075556464 400813 0 0
GPU 6 206017321416 205358999314 401594 0 0
GPU 7 205727189948 205992211811 402166 0 0
GPU 0 205056776709 205672825115 401232 0 0
GPU 1 205047062591 205034868008 400528 0 0
GPU 2 205034439289 205003228751 400382 0 0
…
Any input on what might be missing? Thank you.
System details:
dpkg -l | grep datacenter
ii datacenter-gpu-manager 1:2.0.10 amd64 NVIDIA® Datacenter GPU Management Tools
nvidia-smi
Mon Sep 14 18:14:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 28C P0 60W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
...