Nvidia-smi: gpm-metrics not populated

When running ‘nvidia-smi dmon --gpm-metrics=2,3,4,5’ (other metric choices present the same issue)
each gpm metric is reported as ‘0’ for the first update and ‘-’ thereafter.

System:
Ubuntu 20.04.5 LTS (GNU/Linux 5.15.0-1013-oracle x86_64)
NVIDIA A100-SXM4-80GB
NVIDIA-SMI 525.60.13

and same result on Windows:
Windows 10 Pro 64-bit
NVIDIA GeForce RTX 3090
NVIDIA-SMI 526.47

Is a specific version of nvml / debug mode / system configuration required for accessing these stats?

Since nvidia-smi wraps nvml, I wrote a quick C++ nvml program to test gpm-metrics queries. ‘nvmlGpmQueryDeviceSupport’ returns isSupportedDevice=0 for both cards listed above (both Ampere).

From the NVML API ref, Ampere is not listed as a fully supported device architecture.

nvidia-smi man page also states “If any of the metric is not supported on the device or any other error in fetching the metric is reported as “-” in the output data.”

Hi @gkennickell ,

I notice the same problem as you, but I read the NVML API documentation differently. The A100 should be supported?

For a full list of supported Linux OS distributions, refer to the Tesla driver release notes at NVIDIA Data Center GPU Driver Documentation.
Supported products

  • Full Support
    • NVIDIA Tesla Line:
      • A100, A40, A30, A16, A10
      • T4

I see no change no matter if I access the gpm metrics as a regular user or as “root” - is there some kernel module parameter that needs to be enabled to allow access to the gpm metrics?

Hi @nsmeds – I was using the documentation in the header file regarding supported devices:

It wasn’t clear if the A100 was technically considered Tesla since the ‘Tesla’ moniker was dropped at the same time as its initial release, instead it was branded Data Center GPU. But the API ref doc you linked clears that up.

I haven’t had the opportunity to test if ‘isSupportedDevice’ returns True for an older card like the V100; that might help determine if it’s truly a support-level issue, or if, like you mention, a kernel parameter needs to be set in order to use.

This usage appears to work, though. Also from the man page for nvidia-smi. Even PCI metrics appears :-D Next step would be to see if they can be verified. It is not clear what is shown, instantaneous values or a mean over the sample interval. But at least it is something that probably can be used for comparaisons between runs.

And there is no indication of FP usage, or tensor core utilization and other metrics that are also of interest.

[nsmeds@ice0102 sample1]$ nvidia-smi dmon -s pucvmet -o DT
#Date       Time        gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk pviol tviol    fb  bar1 sbecc dbecc   pci rxpci txpci
#YYYYMMDD   HH:MM:SS    Idx     W     C     C     %     %     %     %   MHz   MHz     %  bool    MB    MB  errs  errs  errs  MB/s  MB/s
 20230608   18:54:55      0    195     50     51     92     56      0      0   1215   1410      0      0  38609      4      0      0      0     21    318 
 20230608   18:54:55      1    224     51     54     59     44      0      0   1215   1410      0      0  38609      4      0      0      0   2264      0 
 20230608   18:54:55      2     86     48     50     74     54      0      0   1215   1410      0      0  38609      4      0      0      0     21      7 
 20230608   18:54:55      3    200     50     53     74     55      0      0   1215   1395      0      0  38609      4      0      0      0     20   2811 
 20230608   18:54:55      4    226     55     59     62     44      0      0   1215   1410      0      0  38609      4      0      0      0   3684      3 
 20230608   18:54:55      5    237     55     56     74     58      0      0   1215   1410      0      0  38609      4      0      0      0     31      7 
 20230608   18:54:55      6     68     55     56     75     54      0      0   1215   1395      0      0  38609      4      0      0      0   2630      2 
 20230608   18:54:55      7     92     56     57     68     47      0      0   1215   1410      0      0  38609      4      0      0      0   1453      5 
 20230608   18:54:57      0    182     49     50     92     59      0      0   1215   1410      0      0  38609      4      0      0      0     29      7 
 20230608   18:54:57      1    234     51     54     74     54      0      0   1215   1410      0      0  38609      4      0      0      0   2626      3 
 20230608   18:54:57      2    110     49     52     73     58      0      0   1215   1410      0      0  38609      4      0      0      0   1756      5 
 20230608   18:54:57      3    188     50     53     75     59      0      0   1215   1410      0      0  38609      4      0      0      0     28      7 
 20230608   18:54:57      4    215     56     58     74     54      0      0   1215   1410      0      0  38609      4      0      0      0   2273    354 
 20230608   18:54:57      5    239     56     59     59     44      0      0   1215   1410      0      0  38609      4      0      0      0   2850      6 
 20230608   18:54:57      6     68     54     55     61     44      0      0   1215   1410      0      0  38609      4      0      0      0     30      6 
 20230608   18:54:57      7    219     57     59     76     53      0      0   1215   1395      0      0  38609      4      0      0      0   2707    892 

GPM metrics are not supported in the pre-H100 SKU.

2 Likes

Thanks,

So the return code from the query is correct. It would be good if nvidia-smi wrote a warning that some metrics requested are nor available and then output “-” consistently for these (not only three first iteration).

/Nils

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.