Profiling failed because a driver resource was unavailable

I am having issues running ncu from the command line. I executed

ncu -o profile python test.py

from a linux terminal (test.py calls cudnn/cuda kernels), and it produces the output:

==PROF== Connected to process 839907 (/data/users/dzdang/miniconda3/envs/pytorch/bin/python3.9)

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile kernel "distribution_elementwise_grid..." in process 839907
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I read the FAQ page in the output and I tried

dcgmi profile --pause

but this didn’t resolve the issue.

The FAQ also stated that the issue could be due to " another instance of NVIDIA Nsight Compute without access to the same file system (see serialization for how this is prevented within the same file system)."

I located the nsight-compute-lock file as instructed, but it is empty. Is something supposed to be inside it?

These are the outputs of ncu --version and nvidia-smi, resp. :

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2021 NVIDIA Corporation
Version 2021.3.0.0 (build 30414874) (public-release)

and

Tue Mar  8 06:41:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PG5...  On   | 00000000:11:00.0 Off |                    0 |
| N/A   28C    P0    49W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PG5...  On   | 00000000:12:00.0 Off |                    0 |
| N/A   30C    P0    51W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PG5...  On   | 00000000:48:00.0 Off |                    0 |
| N/A   28C    P0    51W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PG5...  On   | 00000000:49:00.0 Off |                    0 |
| N/A   30C    P0    52W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PG5...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   28C    P0    49W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PG5...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   29C    P0    49W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PG5...  On   | 00000000:C6:00.0 Off |                    0 |
| N/A   28C    P0    52W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PG5...  On   | 00000000:C9:00.0 Off |                    0 |
| N/A   29C    P0    51W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:11:00.0

Not sure if the WARNING: infoROM is corrupted at gpu 0000:11:00.0 has anything to do with it?

I also read Which application accesses the driver's performance monitor - #4 by felix_dt and Question about GPU Operator (DCGM) relation ship? and neither helped.

I did a sudo reboot, and ran the command again and it worked, but subsequent executions produced the same error. My guess is that when it was successful, the conflicting program/process wasn’t started up yet since I ran ncu shortly after the system rebooted and probably prior to all the startup processes launching. How can I go about debugging this issue?

(Also the reboot removed the WARNING: infoROM is corrupted at gpu 0000:11:00.0 issue)

1 Like

I have the same problem, is it solved?

This type of issue could have many different causes. Can you share some more information about your system? What GPU, driver, and tools versions do you have? And can you share the CLI and error you run into? Also, the output of nvidia-smi may be informative.

A100 and NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4
I run ncu -o llama_7b --csv python examples/offline_inference.py to profile llama-7b
but got an error:


It seems the driver resource is unavailable! But can’t figure out what the driver is occupied by? what can I do to solve it

Can you run “nvidia-smi” and share the output? This may show what is using the profiling resources. Also, what version of Nsight Compute are you using? You can find this with “ncu --version”.