I encountered an error message that says “not supported” when checking the GPU metrics device on an NVIDIA A10 GPU:
root [ /my-files ]# /opt/nvidia/nsight-systems/2023.1.1/bin/nsys profile --gpu-metrics-device=help
Possible --gpu-metrics-device values are:
0: NVIDIA A10-24Q PCI[7dca:00:00.0] (not supported)
Is there a plan to support GPU metrics on the A10 GPU?
Or is A10 supported, but my setup is incorrect for some reason?
My setup works with A100 and T4.
I am surprised that this isn’t working for you.
What driver are you using?
@Andrey_Trachenko, who set up the supported metrics?
I think it’s just an Azure internal setup issue. Even nvidia-smi and nvidia-smi pmon do not show any info from the GPU such as the load:
nvidia-smi
Mon Mar 20 18:49:02 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10-24Q On | 00008F62:00:00.0 Off | 0 |
| N/A N/A P8 N/A / N/A | 127MiB / 24512MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1367 G /usr/lib/xorg/Xorg 103MiB |
| 0 N/A N/A 1919 G /usr/bin/gnome-shell 22MiB |
+-----------------------------------------------------------------------------+
I can successfully run NSight and obtain profiles, so apparently this issue pertains to some specific part of GPU metrics.
I had other priorities and have not followed up with Azure support yet.
Thank you
As I understand, Azure VMs might be running on top of Hyper-V, and this is currently not a supported configuration to collect GPU metrics. I believe the issue relates to the permissions model, where device-wide collection is not possible on a multi-tenant host.
Maybe. However, I can successfully get GPU metrics and nvidia-smi pmon output from Azure VMs running T4s and A100s. Only A10s for some reason does not support it. :(
Thanks
I’m not sure what is the software stack on the Azure VMs. Maybe you could ask their support if GPU HWPM profiling capabilities are enabled on the A10 systems that you use. Sorry for not being more helpful at the moment.