I am having issues running ncu
from the command line. I executed
ncu -o profile python test.py
from a linux terminal (test.py calls cudnn/cuda kernels), and it produces the output:
==PROF== Connected to process 839907 (/data/users/dzdang/miniconda3/envs/pytorch/bin/python3.9)
==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile kernel "distribution_elementwise_grid..." in process 839907
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
I read the FAQ page in the output and I tried
dcgmi profile --pause
but this didn’t resolve the issue.
The FAQ also stated that the issue could be due to " another instance of NVIDIA Nsight Compute without access to the same file system (see serialization for how this is prevented within the same file system)."
I located the nsight-compute-lock
file as instructed, but it is empty. Is something supposed to be inside it?
These are the outputs of ncu --version
and nvidia-smi
, resp. :
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2021 NVIDIA Corporation
Version 2021.3.0.0 (build 30414874) (public-release)
and
Tue Mar 8 06:41:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PG5... On | 00000000:11:00.0 Off | 0 |
| N/A 28C P0 49W / 330W | 3MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PG5... On | 00000000:12:00.0 Off | 0 |
| N/A 30C P0 51W / 330W | 3MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PG5... On | 00000000:48:00.0 Off | 0 |
| N/A 28C P0 51W / 330W | 3MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PG5... On | 00000000:49:00.0 Off | 0 |
| N/A 30C P0 52W / 330W | 3MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-PG5... On | 00000000:86:00.0 Off | 0 |
| N/A 28C P0 49W / 330W | 3MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-PG5... On | 00000000:89:00.0 Off | 0 |
| N/A 29C P0 49W / 330W | 3MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-PG5... On | 00000000:C6:00.0 Off | 0 |
| N/A 28C P0 52W / 330W | 3MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-PG5... On | 00000000:C9:00.0 Off | 0 |
| N/A 29C P0 51W / 330W | 3MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:11:00.0
Not sure if the WARNING: infoROM is corrupted at gpu 0000:11:00.0
has anything to do with it?
I also read Which application accesses the driver's performance monitor - #4 by felix_dt and Question about GPU Operator (DCGM) relation ship? and neither helped.