Nvidia-smi: gpm-metrics not populated

gkennickell · February 13, 2023, 6:19pm

When running ‘nvidia-smi dmon --gpm-metrics=2,3,4,5’ (other metric choices present the same issue)
each gpm metric is reported as ‘0’ for the first update and ‘-’ thereafter.

System:
Ubuntu 20.04.5 LTS (GNU/Linux 5.15.0-1013-oracle x86_64)
NVIDIA A100-SXM4-80GB
NVIDIA-SMI 525.60.13

and same result on Windows:
Windows 10 Pro 64-bit
NVIDIA GeForce RTX 3090
NVIDIA-SMI 526.47

Is a specific version of nvml / debug mode / system configuration required for accessing these stats?

gkennickell · March 1, 2023, 10:00pm

Since nvidia-smi wraps nvml, I wrote a quick C++ nvml program to test gpm-metrics queries. ‘nvmlGpmQueryDeviceSupport’ returns isSupportedDevice=0 for both cards listed above (both Ampere).

From the NVML API ref, Ampere is not listed as a fully supported device architecture.

nvidia-smi man page also states “If any of the metric is not supported on the device or any other error in fetching the metric is reported as “-” in the output data.”

nsmeds · June 2, 2023, 2:53pm

Hi @gkennickell ,

I notice the same problem as you, but I read the NVML API documentation differently. The A100 should be supported?

For a full list of supported Linux OS distributions, refer to the Tesla driver release notes at NVIDIA Data Center GPU Driver Documentation.
Supported products

Full Support

NVIDIA Tesla Line:

A100, A40, A30, A16, A10

T4
…

I see no change no matter if I access the gpm metrics as a regular user or as “root” - is there some kernel module parameter that needs to be enabled to allow access to the gpm metrics?

gkennickell · June 6, 2023, 11:49pm

Hi @nsmeds – I was using the documentation in the header file regarding supported devices:

github.com

NVIDIA/nvidia-settings/blob/main/src/nvml.h#L51


      
          3rd party applications, and is also the underlying library for the NVIDIA-supported nvidia-smi
          tool. NVML is thread-safe so it is safe to make simultaneous NVML calls from multiple threads.
          
          
API Documentation
          
          
Supported platforms:
          - Windows:     Windows Server 2008 R2 64bit, Windows Server 2012 R2 64bit, Windows 7 64bit, Windows 8 64bit, Windows 10 64bit
          - Linux:       32-bit and 64-bit
          - Hypervisors: Windows Server 2008R2/2012 Hyper-V 64bit, Citrix XenServer 6.2 SP1+, VMware ESX 5.1/5.5
          
          
Supported products:
          - Full Support
              - All Tesla products, starting with the Fermi architecture
              - All Quadro products, starting with the Fermi architecture
              - All vGPU Software products, starting with the Kepler architecture
              - Selected GeForce Titan products
          - Limited Support
              - All Geforce products, starting with the Fermi architecture
          
          
The NVML library can be found at \%ProgramW6432\%\\"NVIDIA Corporation"\\NVSMI\\ on Windows. It is
          not be added to the system path by default. To dynamically link to NVML, add this path to the PATH

It wasn’t clear if the A100 was technically considered Tesla since the ‘Tesla’ moniker was dropped at the same time as its initial release, instead it was branded Data Center GPU. But the API ref doc you linked clears that up.

I haven’t had the opportunity to test if ‘isSupportedDevice’ returns True for an older card like the V100; that might help determine if it’s truly a support-level issue, or if, like you mention, a kernel parameter needs to be set in order to use.

nsmeds · June 8, 2023, 5:03pm

This usage appears to work, though. Also from the man page for nvidia-smi. Even PCI metrics appears :-D Next step would be to see if they can be verified. It is not clear what is shown, instantaneous values or a mean over the sample interval. But at least it is something that probably can be used for comparaisons between runs.

And there is no indication of FP usage, or tensor core utilization and other metrics that are also of interest.

[nsmeds@ice0102 sample1]$ nvidia-smi dmon -s pucvmet -o DT
#Date       Time        gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk pviol tviol    fb  bar1 sbecc dbecc   pci rxpci txpci
#YYYYMMDD   HH:MM:SS    Idx     W     C     C     %     %     %     %   MHz   MHz     %  bool    MB    MB  errs  errs  errs  MB/s  MB/s
 20230608   18:54:55      0    195     50     51     92     56      0      0   1215   1410      0      0  38609      4      0      0      0     21    318 
 20230608   18:54:55      1    224     51     54     59     44      0      0   1215   1410      0      0  38609      4      0      0      0   2264      0 
 20230608   18:54:55      2     86     48     50     74     54      0      0   1215   1410      0      0  38609      4      0      0      0     21      7 
 20230608   18:54:55      3    200     50     53     74     55      0      0   1215   1395      0      0  38609      4      0      0      0     20   2811 
 20230608   18:54:55      4    226     55     59     62     44      0      0   1215   1410      0      0  38609      4      0      0      0   3684      3 
 20230608   18:54:55      5    237     55     56     74     58      0      0   1215   1410      0      0  38609      4      0      0      0     31      7 
 20230608   18:54:55      6     68     55     56     75     54      0      0   1215   1395      0      0  38609      4      0      0      0   2630      2 
 20230608   18:54:55      7     92     56     57     68     47      0      0   1215   1410      0      0  38609      4      0      0      0   1453      5 
 20230608   18:54:57      0    182     49     50     92     59      0      0   1215   1410      0      0  38609      4      0      0      0     29      7 
 20230608   18:54:57      1    234     51     54     74     54      0      0   1215   1410      0      0  38609      4      0      0      0   2626      3 
 20230608   18:54:57      2    110     49     52     73     58      0      0   1215   1410      0      0  38609      4      0      0      0   1756      5 
 20230608   18:54:57      3    188     50     53     75     59      0      0   1215   1410      0      0  38609      4      0      0      0     28      7 
 20230608   18:54:57      4    215     56     58     74     54      0      0   1215   1410      0      0  38609      4      0      0      0   2273    354 
 20230608   18:54:57      5    239     56     59     59     44      0      0   1215   1410      0      0  38609      4      0      0      0   2850      6 
 20230608   18:54:57      6     68     54     55     61     44      0      0   1215   1410      0      0  38609      4      0      0      0     30      6 
 20230608   18:54:57      7    219     57     59     76     53      0      0   1215   1395      0      0  38609      4      0      0      0   2707    892

tuannguyen · June 21, 2023, 7:54pm

GPM metrics are not supported in the pre-H100 SKU.

nsmeds · June 26, 2023, 5:10pm

Thanks,

So the return code from the query is correct. It would be good if nvidia-smi wrote a warning that some metrics requested are nor available and then output “-” consistently for these (not only three first iteration).

/Nils

system · November 29, 2023, 3:54pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems Profiling Linux Targets profiling	12	765	June 5, 2024
Tesla V100 doesn't support GPU-Metrics Collection Profiling Linux Targets	2	56	August 8, 2024
RESOLVED!!! \| GPU missing from nvidia-smi but seen in lspci CUDA Setup and Installation	9	12619	April 11, 2024
future nvidia-smi releases Linux	1	793	October 7, 2013
GPU not detected by nvidia-smi Linux	0	155	July 31, 2024
Bug: NVML incorrectly detects certain GPUs as unsupported. System Management and Monitoring (NVML)	9	11259	January 30, 2014
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	384	September 11, 2024
Nvidia-smi doesn't find any devices (Tesla T4, Proxmox/Debian and vgpu) Tesla Boards linux-driver	2	162	January 20, 2025
Ailed to initialize NVML: Driver/library version mismatch Linux kernel , linux-driver-solutions , drivers	9	4716	April 20, 2022
Nvidia-smi failed to detect all GPU cards CUDA Setup and Installation	11	13203	December 14, 2018

Nvidia-smi: gpm-metrics not populated

Related topics