Custom fields added to DCGM return 0

Hello. I am trying to add custom fields to DCGM, but any additional field other than the defaults is returning 0.

I tried modifying both the Python as well as C++ examples here:
/usr/local/dcgm/bindings/DcgmReaderExample.py
/usr/local/dcgm/sdk_samples/c_src/field_value_sample/field_value_sample.cpp

Also looked at the documentation here: https://docs.nvidia.com/datacenter/dcgm/2.0/pdf/DCGM_User_Guide.pdf section 4.1.4 Additional Customization

dcgmi does show the custom fields:


dcgmi dmon -l | grep nvswitch_latency_histogram
nvswitch_latency_histogram_low_p00                     SLL00            700
nvswitch_latency_histogram_med_p00                     SLM00            701
nvswitch_latency_histogram_high_p00                    SHL00            702
nvswitch_latency_histogram_max_p00                     SLX00            703
nvswitch_latency_histogram_low_p01                     SLL01            704
nvswitch_latency_histogram_med_p01                     SLM01            705
nvswitch_latency_histogram_high_p01                    SLH01            706
nvswitch_latency_histogram_max_p01                     SLX01            707
...
 dcgmi dmon -l | grep nvlink_bandwidth
nvlink_bandwidth_l0                                    NBWL0            440
nvlink_bandwidth_l1                                    NBWL1            441
nvlink_bandwidth_l2                                    NBWL2            442
nvlink_bandwidth_l3                                    NBWL3            443
nvlink_bandwidth_l4                                    NBWL4            444
nvlink_bandwidth_l5                                    NBWL5            445

But the value returned for these additional fields is always 0. Here’s the output of dcgmi during an allreduce test on 8GPUs with message size 64MB and 5000 iterations. The non-zero values returned below are for the default fields of transmitted bytes, received bytes and Nvlink total BW. Any other custom field added always returns 0.


dcgmi dmon -e 1011,1012,449,440,441 -d 100
# Entity NVLTX NVLRX NBWLT NBWL0 NBWL1
Id MB/s^T MB/s^T MB/s^T
GPU 0 205082371557 205696841961 401614 0 0
GPU 1 205084045289 205084477338 401028 0 0
GPU 2 205092980007 205092399581 401062 0 0
GPU 3 205081561723 205080801023 401063 0 0
GPU 4 205098144886 205098765138 401190 0 0
GPU 5 205392418343 205085263573 401385 0 0
GPU 6 206086025936 205473712850 402350 0 0
GPU 7 205770384146 206069539258 402396 0 0
GPU 0 205053821100 205671712369 401158 0 0
GPU 1 205088325022 205088620733 400435 0 0
GPU 2 204991967373 205028901802 400522 0 0
GPU 3 204993510903 205034220882 400436 0 0
GPU 4 204919353085 204921515699 400405 0 0
GPU 5 205359401368 205075556464 400813 0 0
GPU 6 206017321416 205358999314 401594 0 0
GPU 7 205727189948 205992211811 402166 0 0
GPU 0 205056776709 205672825115 401232 0 0
GPU 1 205047062591 205034868008 400528 0 0
GPU 2 205034439289 205003228751 400382 0 0

Any input on what might be missing? Thank you.

System details:



dpkg -l | grep datacenter
ii  datacenter-gpu-manager                                      1:2.0.10                                        amd64        NVIDIA® Datacenter GPU Management Tools

nvidia-smi
Mon Sep 14 18:14:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   28C    P0    60W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
...