Profiling failed because a driver resource was unavailable

dzdang · March 8, 2022, 2:52pm

I am having issues running ncu from the command line. I executed

ncu -o profile python test.py

from a linux terminal (test.py calls cudnn/cuda kernels), and it produces the output:

==PROF== Connected to process 839907 (/data/users/dzdang/miniconda3/envs/pytorch/bin/python3.9)

==ERROR== Profiling failed because a driver resource was unavailable. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#faq for more details.
==ERROR== Failed to profile kernel "distribution_elementwise_grid..." in process 839907
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I read the FAQ page in the output and I tried

dcgmi profile --pause

but this didn’t resolve the issue.

The FAQ also stated that the issue could be due to " another instance of NVIDIA Nsight Compute without access to the same file system (see serialization for how this is prevented within the same file system)."

I located the nsight-compute-lock file as instructed, but it is empty. Is something supposed to be inside it?

These are the outputs of ncu --version and nvidia-smi, resp. :

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2021 NVIDIA Corporation
Version 2021.3.0.0 (build 30414874) (public-release)

and

Tue Mar  8 06:41:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PG5...  On   | 00000000:11:00.0 Off |                    0 |
| N/A   28C    P0    49W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PG5...  On   | 00000000:12:00.0 Off |                    0 |
| N/A   30C    P0    51W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PG5...  On   | 00000000:48:00.0 Off |                    0 |
| N/A   28C    P0    51W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PG5...  On   | 00000000:49:00.0 Off |                    0 |
| N/A   30C    P0    52W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PG5...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   28C    P0    49W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PG5...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   29C    P0    49W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PG5...  On   | 00000000:C6:00.0 Off |                    0 |
| N/A   28C    P0    52W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PG5...  On   | 00000000:C9:00.0 Off |                    0 |
| N/A   29C    P0    51W / 330W |      3MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:11:00.0

Not sure if the WARNING: infoROM is corrupted at gpu 0000:11:00.0 has anything to do with it?

I also read Which application accesses the driver's performance monitor - #4 by felix_dt and Question about GPU Operator (DCGM) relation ship? and neither helped.

dzdang · March 8, 2022, 3:45pm

I did a sudo reboot, and ran the command again and it worked, but subsequent executions produced the same error. My guess is that when it was successful, the conflicting program/process wasn’t started up yet since I ran ncu shortly after the system rebooted and probably prior to all the startup processes launching. How can I go about debugging this issue?

(Also the reboot removed the WARNING: infoROM is corrupted at gpu 0000:11:00.0 issue)

1129973724 · March 14, 2023, 5:42am

I have the same problem, is it solved?

jmarusarz · March 16, 2023, 7:38pm

This type of issue could have many different causes. Can you share some more information about your system? What GPU, driver, and tools versions do you have? And can you share the CLI and error you run into? Also, the output of nvidia-smi may be informative.

liufei3 · August 15, 2023, 6:40am

A100 and NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4
I run ncu -o llama_7b --csv python examples/offline_inference.py to profile llama-7b
but got an error:

It seems the driver resource is unavailable! But can’t figure out what the driver is occupied by? what can I ｄｏｔｏｓｏｌｖｅｉｔ

jmarusarz · August 17, 2023, 8:18pm

Can you run “nvidia-smi” and share the output? This may show what is using the profiling resources. Also, what version of Nsight Compute are you using? You can find this with “ncu --version”.

Topic		Replies	Views
Profiling failed because a driver resource was unavailable Nsight Compute	5	100	April 1, 2026
CUDA_ERROR_INVALID_CONTEXT and driver resource unavailable when profiling Nsight Compute	3	515	November 11, 2024
Nsight Compute profile failed Nsight Compute kernel	4	280	July 17, 2025
NCU can not profile in L20 Nsight Compute	5	126	April 14, 2026
Cannot profile CUDA kernel using NC : Run Bottleneck returned an error Nsight Compute	3	638	September 25, 2020
NVIDIA NSight Compute: The profiler returned an error code:1 Nsight Compute	12	2471	January 22, 2024
Nvprof works but nsight compute gives "no kernels were profiled" warning Nsight Compute	2	1639	August 23, 2022
Which application accesses the driver's performance monitor Nsight Compute	6	1622	May 25, 2021
Unable to profile with NCU -- WARNING: No Kernels were profiled Nsight Compute cuda , nsight , deep-learning-profiler , profiling	3	2016	May 15, 2023
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1817	July 27, 2023

Profiling failed because a driver resource was unavailable

Related topics