Can't get GPU Metrics with nsight-system

when I run the program:

python3 main_tcgnn.py --dataset citeseer --dim 3703 --hidden 16 --classes 6 --num_layers 2 --model gcn

it can work flawlessly, below is the result:

however, when I use nsight-system to profile the program, the command like :

sudo /usr/local/cuda/nsight-system/bin/nsys profile --stats=true --gpu-metrics-device=0 --gpu-metrics-frequency=10000 python3 main_tcgnn.py --dataset citeseer --dim 3703 --hidden 16 --classes 6 --num_layers 2 --model gcn

the programe also run well, but I can’t get the GPU Metrics in the generated report, here:

and the report generated with some error, the detail:

enviroment:
NVIDIA Nsight Systems version 2022.5.1.82-32078057v0
RTX 2080 Ti(11GB) * 1

so how can I solve the problem to get the right GPU Metrics like tensor core activity?

What driver do you have?

Can you send me the results from running “nsys status -e” at the command line?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.76       Driver Version: 515.76       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:40:00.0 Off |                  N/A |
| 31%   29C    P8    34W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@autodl-container-7b5011b452-56963392:~/ly-zjlab/TCGNN# /usr/local/cuda/nsight-system/bin/nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 3
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-126-generic: OK
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): Fail
CPU Profiling Environment (system-wide): Fail

@pkovalenko , can you please chime in on this?

It looks like this is being run from a docker container, right? The page referenced in the diagnostic message (https://developer.nvidia.com/ERR_NVGPUCTRPERM) describes the right steps to fix the issue. Specifically, --cap-add=SYS_ADMIN has to be added to docker run arguments.

1 Like

HI did u solve this problem, i met the same problem when using autodl gpu machine

maybe there’s some problem with the autodl, after we turned to A100, such problem doesn’t appear again.

work when I run “docker run xxx --cap-add=SYS_ADMIN”