Hi,
I am interested in collecting the following two GPU metrics on the GH200 using Nsys CLI:
- NVLink bytes received - nvlrx__bytes.avg.pct_of_peak_sustained_elapsed
- NVLink bytes transmitted - nvltx__bytes.avg.pct_of_peak_sustained_elapsed
However, when I have two issues
- gpu-metrics-set does not have an alias for gh200 as of CUDA 12.4. Also, I get the error
None of the installed GPUs are supported
$ nvidia-smi
Wed Aug 7 13:10:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 120GB On | 00000009:01:00.0 Off | 0 |
| N/A 30C P0 89W / 900W | 20MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ nsys profile --gpu-metrics-set=help
Possible --gpu-metrics-set values are:
tu10x : General Metrics for NVIDIA TU10x (any frequency)
tu11x : General Metrics for NVIDIA TU11x (any frequency)
ga100 : General Metrics for NVIDIA GA100 (any frequency)
ga10x : General Metrics for NVIDIA GA10x (any frequency)
gh100 : General Metrics for NVIDIA GH100 (any frequency)
ad10x : General Metrics for NVIDIA AD10x (any frequency)
ga10b-gfxt : Graphics Throughput Metrics for NVIDIA GA10B (frequency >= 10kHz)
tu10x-gfxt : Graphics Throughput Metrics for NVIDIA TU10x (frequency >= 10kHz)
ga10x-gfxt : Graphics Throughput Metrics for NVIDIA GA10x (frequency >= 10kHz)
ad10x-gfxt : Graphics Throughput Metrics for NVIDIA AD10x (frequency >= 10kHz)
ga10x-gfxact : Graphics Async Compute Triage Metrics for NVIDIA GA10x (frequency >= 10kHz)
$ nsys profile --gpu-metrics-set=gh100 --gpu-metrics-device=all
Illegal --gpu-metrics-device arguments.
None of the installed GPUs are supported
- Could you please point me to the list of metrics/perf counters that do not require root access on NSys? I could not find this in the documentation
$ nsys profile --cpu-core-metrics=0 --gpu-metrics-device=0 --cuda-um-cpu-page-faults=true --cuda-um-gpu-page-faults=true ls
The user running Nsight Systems does not have permission to access NVIDIA GPU Performance Counters on the target device. For more details, please visit https://developer.nvidia.com/ERR_NVGPUCTRPERM.