- No, I have never used Nsight Compute on this (or any other) machine. I recently upgraded to this laptop. It refuses to install on my previous laptop (installer hangs).
- There are no lock files present in my temp directory before running Compute. It creates a lock file when I attempt to profile.
- The previous version of Compute seemed to delete the lock file on exit in my first tests, but 2026.1.0 appears to leave it behind.
- Deleting the lock file, re-launching Compute, and attempting to profile again does not affect anything.
- Is there a particular sample you would like me to try? I tried the uncoalesced global accesses below.
double3 constant addition of 1048576 elements
kernelOption=0
==PROF== Connected to process 15772 (C:\Code\Nsight Samples\uncoalescedGlobalAccesses\uncoalescedGlobalAccesses.exe)
CUDA kernel addConstDouble3 launch with 4096 blocks of 256 threads
==PROF== Profiling “addConstDouble3” - 0: 0%…50%…100% - 9 passes
Done
==PROF== Disconnected from process 15772
[15772] uncoalescedGlobalAccesses.exe@127.0.0.1
addConstDouble3(int, double3 *, double, double3 *) (4096, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 12.0
Section: GPU Speed Of Light Throughput
Metric Name Metric Unit Metric Value
DRAM Frequency Ghz 8.99
SM Frequency Ghz 1.38
Elapsed Cycles cycle 194,604
Memory Throughput % 81.43
DRAM Throughput % 81.43
Duration us 141.09
L1/TEX Cache Throughput % 92.06
L2 Cache Throughput % 71.95
SM Active Cycles cycle 186,407.85
Compute (SM) Throughput % 31.18
INF This workload is utilizing greater than 80.0% of the available compute or memory performance of this device.
To further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing DRAM in the Memory Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 256
Cluster Scheduling Policy PolicySpread
Cluster Size 0
Function Cache Configuration CachePreferNone
Grid Size 4,096
Preferred Cluster Size 0
Registers Per Thread register/thread 18
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
# SMs SM 26
Stack Size 1,024
Threads thread 1,048,576
# TPCs 13
Enabled TPC IDs all
Uses Green Context 0
Waves Per SM 26.26
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Max Active Clusters cluster 0
Max Cluster Size block 8
Overall GPU Occupancy % 0
Cluster Occupancy % 0
Block Limit Barriers block 24
Block Limit SM block 24
Block Limit Registers block 10
Block Limit Shared Mem block 16
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 78.55
Achieved Active Warps Per SM warp 37.70
------------------------------- ----------- ------------
OPT Est. Local Speedup: 21.45%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (78.5%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
Section: GPU and Memory Workload Distribution
-------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------- ----------- ------------
Average DRAM Active Cycles cycle 1,033,168
Total DRAM Elapsed Cycles cycle 5,074,944
Average L1 Active Cycles cycle 186,407.85
Total L1 Elapsed Cycles cycle 5,044,264
Average L2 Active Cycles cycle 196,851.69
Total L2 Elapsed Cycles cycle 3,239,056
Average SM Active Cycles cycle 186,407.85
Total SM Elapsed Cycles cycle 5,044,264
Average SMSP Active Cycles cycle 186,071.76
Total SMSP Elapsed Cycles cycle 20,177,056
-------------------------- ----------- ------------