Profiling failed because a driver resource was unavailable

I have several working kernels that are providing OK speeds but need to be tuned. When I try to profile through the Nsight Compute GUI using “Start Activity…”, I get “Profiling failed because a driver resource was unavailable.”
So far, I have tried (with reboots between most steps):

  • Updating to the latest Compute (2026.1.0)
  • Updating to the latest driver for my GPU (595.97)
  • Ensuring that developer mode was enabled and GPU performance counters are accessible to all users (enabled as an admin)
  • Ensuring that no other processes are running that might be accessing the profiling resources
  • Ensuring that no stray lock files exist
  • Running from the GUI and CLI with elevated privileges
  • Disabling HAGS

I have CUDA Toolkit 13.1.1 installed at the moment. Running on Windows 11. Building with MSVC 19.50.35726.0 (Visual Studio Community Edition).

I appreciate any insights others have to offer.

I tried updating to CUDA 13.2.0. Same outcome.

Hi, @jimstack

Sorry for the issue you met.
Your settings seems nothing wrong.

  • Have you ever profiled successfully on this machine ?
  • Can you please check if there is night-compute-lock related file like C:\Users${User}\AppData\Local\Temp\nsight-compute-lock*? If yes, please delete those files and try again
  • Can you please provide the output of ncu $sample ?
  • No, I have never used Nsight Compute on this (or any other) machine. I recently upgraded to this laptop. It refuses to install on my previous laptop (installer hangs).
  • There are no lock files present in my temp directory before running Compute. It creates a lock file when I attempt to profile.
    • The previous version of Compute seemed to delete the lock file on exit in my first tests, but 2026.1.0 appears to leave it behind.
    • Deleting the lock file, re-launching Compute, and attempting to profile again does not affect anything.
  • Is there a particular sample you would like me to try? I tried the uncoalesced global accesses below.

double3 constant addition of 1048576 elements
kernelOption=0
==PROF== Connected to process 15772 (C:\Code\Nsight Samples\uncoalescedGlobalAccesses\uncoalescedGlobalAccesses.exe)
CUDA kernel addConstDouble3 launch with 4096 blocks of 256 threads
==PROF== Profiling “addConstDouble3” - 0: 0%…50%…100% - 9 passes
Done
==PROF== Disconnected from process 15772
[15772] uncoalescedGlobalAccesses.exe@127.0.0.1
addConstDouble3(int, double3 *, double, double3 *) (4096, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 12.0
Section: GPU Speed Of Light Throughput


Metric Name Metric Unit Metric Value


DRAM Frequency Ghz 8.99
SM Frequency Ghz 1.38
Elapsed Cycles cycle 194,604
Memory Throughput % 81.43
DRAM Throughput % 81.43
Duration us 141.09
L1/TEX Cache Throughput % 92.06
L2 Cache Throughput % 71.95
SM Active Cycles cycle 186,407.85
Compute (SM) Throughput % 31.18


INF   This workload is utilizing greater than 80.0% of the available compute or memory performance of this device.
      To further improve performance, work will likely need to be shifted from the most utilized to another unit.
      Start by analyzing DRAM in the Memory Workload Analysis section.

Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name                          Metric Unit    Metric Value
-------------------------------- --------------- ---------------
Block Size                                                   256
Cluster Scheduling Policy                           PolicySpread
Cluster Size                                                   0
Function Cache Configuration                     CachePreferNone
Grid Size                                                  4,096
Preferred Cluster Size                                         0
Registers Per Thread             register/thread              18
Shared Memory Configuration Size           Kbyte           16.38
Driver Shared Memory Per Block       Kbyte/block            1.02
Dynamic Shared Memory Per Block       byte/block               0
Static Shared Memory Per Block        byte/block               0
# SMs                                         SM              26
Stack Size                                                 1,024
Threads                                   thread       1,048,576
# TPCs                                                        13
Enabled TPC IDs                                              all
Uses Green Context                                             0
Waves Per SM                                               26.26
-------------------------------- --------------- ---------------

Section: Occupancy
------------------------------- ----------- ------------
Metric Name                     Metric Unit Metric Value
------------------------------- ----------- ------------
Max Active Clusters                 cluster            0
Max Cluster Size                      block            8
Overall GPU Occupancy                     %            0
Cluster Occupancy                         %            0
Block Limit Barriers                  block           24
Block Limit SM                        block           24
Block Limit Registers                 block           10
Block Limit Shared Mem                block           16
Block Limit Warps                     block            6
Theoretical Active Warps per SM        warp           48
Theoretical Occupancy                     %          100
Achieved Occupancy                        %        78.55
Achieved Active Warps Per SM           warp        37.70
------------------------------- ----------- ------------

OPT   Est. Local Speedup: 21.45%
      The difference between calculated theoretical (100.0%) and measured achieved occupancy (78.5%) can be the
      result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
      occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
      Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
      optimizing occupancy.

Section: GPU and Memory Workload Distribution
-------------------------- ----------- ------------
Metric Name                Metric Unit Metric Value
-------------------------- ----------- ------------
Average DRAM Active Cycles       cycle    1,033,168
Total DRAM Elapsed Cycles        cycle    5,074,944
Average L1 Active Cycles         cycle   186,407.85
Total L1 Elapsed Cycles          cycle    5,044,264
Average L2 Active Cycles         cycle   196,851.69
Total L2 Elapsed Cycles          cycle    3,239,056
Average SM Active Cycles         cycle   186,407.85
Total SM Elapsed Cycles          cycle    5,044,264
Average SMSP Active Cycles       cycle   186,071.76
Total SMSP Elapsed Cycles        cycle   20,177,056
-------------------------- ----------- ------------

Is there a memory limit (or related issues) for Compute? I was able to get it to profile my kernels with a trivially small example. My original example occupies ~25% of GPU memory (2 GB). The trivial example is too small to tell me anything meaningful, so I will need to build something in between later.

This seems to have resolved my immediate block. Running with a smaller problem (less than 500 MB right now) allowed the profiling to complete.
I would still appreciate any insight from others on the nature of this issue, since I suspect I will want to increase the size of my profiled problem at some point.