Gpu-metrics-set not found for GH200

Hi,

I am interested in collecting the following two GPU metrics on the GH200 using Nsys CLI:

  • NVLink bytes received - nvlrx__bytes.avg.pct_of_peak_sustained_elapsed
  • NVLink bytes transmitted - nvltx__bytes.avg.pct_of_peak_sustained_elapsed

However, when I have two issues

  1. gpu-metrics-set does not have an alias for gh200 as of CUDA 12.4. Also, I get the error None of the installed GPUs are supported
$ nvidia-smi
Wed Aug  7 13:10:52 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 120GB             On  |   00000009:01:00.0 Off |                    0 |
| N/A   30C    P0             89W /  900W |      20MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ nsys profile --gpu-metrics-set=help
Possible --gpu-metrics-set values are:
        tu10x        : General Metrics for NVIDIA TU10x (any frequency)
        tu11x        : General Metrics for NVIDIA TU11x (any frequency)
        ga100        : General Metrics for NVIDIA GA100 (any frequency)
        ga10x        : General Metrics for NVIDIA GA10x (any frequency)
        gh100        : General Metrics for NVIDIA GH100 (any frequency)
        ad10x        : General Metrics for NVIDIA AD10x (any frequency)
        ga10b-gfxt   : Graphics Throughput Metrics for NVIDIA GA10B (frequency >= 10kHz)
        tu10x-gfxt   : Graphics Throughput Metrics for NVIDIA TU10x (frequency >= 10kHz)
        ga10x-gfxt   : Graphics Throughput Metrics for NVIDIA GA10x (frequency >= 10kHz)
        ad10x-gfxt   : Graphics Throughput Metrics for NVIDIA AD10x (frequency >= 10kHz)
        ga10x-gfxact : Graphics Async Compute Triage Metrics for NVIDIA GA10x (frequency >= 10kHz)

$ nsys profile --gpu-metrics-set=gh100 --gpu-metrics-device=all
Illegal --gpu-metrics-device arguments.
None of the installed GPUs are supported
  1. Could you please point me to the list of metrics/perf counters that do not require root access on NSys? I could not find this in the documentation
$ nsys profile --cpu-core-metrics=0 --gpu-metrics-device=0 --cuda-um-cpu-page-faults=true --cuda-um-gpu-page-faults=true ls
The user running Nsight Systems does not have permission to access NVIDIA GPU Performance Counters on the target device. For more details, please visit https://developer.nvidia.com/ERR_NVGPUCTRPERM.

All hardware performance counters require root/admin access. If you stop and think about what it means to have software able to access the hardware at that level you will understand that that would be a huge security hole in any kind of shared system.

May I ask what version of Nsight Systems you are using?

My nsys version is 2024.1.1.59

I disagree with the blanket statement about needing sudo for any hardware profiling. It is impractical to allow privileged access to all users of research cluster. There are plenty of tools that work to measure hardware stats like perf and even Nvidia ncu without privileged access. Example, /ncu --target-processes all -o fu_report --csv --metrics smsp__pipe_fma_cycles_active.avg.pct_of_peak_sustained_activ. Sure, there can be restrictions on what counters are accessible, but please verify your sources with the “all hardware counters require root”.

My perf_event_paranoid is set to 0.

perf_event_paranoid:

Controls use of the performance events system by unprivileged
users (without CAP_SYS_ADMIN).  The default value is 2.

 -1: Allow use of (almost) all events by all users
     Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>=0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
     Disallow raw tracepoint access by users without CAP_SYS_ADMIN
>=1: Disallow CPU event access by users without CAP_SYS_ADMIN
>=2: Disallow kernel profiling by users without CAP_SYS_ADMIN

Alright, I was being a little general in what I meant by admin - realistically you don’t let anyone set paranoid=0 or set cap-sys-admin unless you are pretty comfortable with them getting information on all jobs on the system not just their own.

Our GPU counters are coming direct from the hardware and require sudo.

The user running Nsight Systems does not have permission to access NVIDIA GPU Performance Counters on the target device. For more details, please visit NVIDIA Development Tools Solutions - | NVIDIA Developer.

The user must be sudo or root must grant all users permissions to collect GPU performance counters. Please read the information above. If this is problematic in your environment, then please file a feature request.

The mods were not helpful here and the information provided was inaccurate and not useful. I was able to get a limited set of counters to work by following the Nvidia article below

modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.