When running ‘nvidia-smi dmon --gpm-metrics=2,3,4,5’ (other metric choices present the same issue)
each gpm metric is reported as ‘0’ for the first update and ‘-’ thereafter.
Since nvidia-smi wraps nvml, I wrote a quick C++ nvml program to test gpm-metrics queries. ‘nvmlGpmQueryDeviceSupport’ returns isSupportedDevice=0 for both cards listed above (both Ampere).
From the NVML API ref, Ampere is not listed as a fully supported device architecture.
nvidia-smi man page also states “If any of the metric is not supported on the device or any other error in fetching the metric is reported as “-” in the output data.”
I see no change no matter if I access the gpm metrics as a regular user or as “root” - is there some kernel module parameter that needs to be enabled to allow access to the gpm metrics?
Hi @nsmeds – I was using the documentation in the header file regarding supported devices:
It wasn’t clear if the A100 was technically considered Tesla since the ‘Tesla’ moniker was dropped at the same time as its initial release, instead it was branded Data Center GPU. But the API ref doc you linked clears that up.
I haven’t had the opportunity to test if ‘isSupportedDevice’ returns True for an older card like the V100; that might help determine if it’s truly a support-level issue, or if, like you mention, a kernel parameter needs to be set in order to use.
This usage appears to work, though. Also from the man page for nvidia-smi. Even PCI metrics appears :-D Next step would be to see if they can be verified. It is not clear what is shown, instantaneous values or a mean over the sample interval. But at least it is something that probably can be used for comparaisons between runs.
And there is no indication of FP usage, or tensor core utilization and other metrics that are also of interest.
So the return code from the query is correct. It would be good if nvidia-smi wrote a warning that some metrics requested are nor available and then output “-” consistently for these (not only three first iteration).