I have a CUDA kernel running slowly on a Jetson ORIN AGX. Sometimes it takes 90ms, 60ms, 30ms, 20ms. I can’t seem to find the reason why this happens. Are there any markers or events that I can add to the cuda kernel to measure how long certain parts of the kernel take?
I use the following code to measure the time.
auto start = std::chrono::high_resolution_clock::now();
kernel_call<<<>>>
cudaDeviceSynchronize();
auto stop= std::chrono::high_resolution_clock::now();
When I measure the time from inside the kernel using clock64() and divide it by the clock rate, the values computed don’t compare to the timing values calculated outside the kernel.
Below is the output from the Nsight Compute CLI. I’m unsure as to which part of the code to optimize from this output.
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
SM Frequency cycle/nsecond 1.14
Elapsed Cycles cycle 10,793,953
Memory [%] % 90.63
Duration msecond 9.42
L1/TEX Cache Throughput % 91.59
L2 Cache Throughput % 3.08
SM Active Cycles cycle 10,621,348.44
Compute (SM) [%] % 40.31
---------------------------------------------------------------------- --------------- ------------------------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing L1 in the Memory Workload Analysis section.
INF The ratio of peak float (fp32) to double (fp64) performance on this device is 64:1. The kernel achieved
close to 0% of this device's fp32 peak performance and 0% of its fp64 peak performance. See the Kernel
Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline) for more details
on roofline analysis.
Section: Compute Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Executed Ipc Active inst/cycle 1.06
Executed Ipc Elapsed inst/cycle 1.05
Issue Slots Busy % 26.56
Issued Ipc Active inst/cycle 1.06
SM Busy % 26.56
---------------------------------------------------------------------- --------------- ------------------------------
No compute pipeline is over-utilized.
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Mem Busy % 90.63
Max Bandwidth % 40.31
L1/TEX Hit Rate % 79.29
L2 Compression Success Rate % 0
L2 Compression Ratio 0
L2 Hit Rate % 98.58
Mem Pipes Busy % 40.31
---------------------------------------------------------------------- --------------- ------------------------------
WRN The memory access pattern for global stores in L1TEX might not be optimal. On average, this kernel accesses
3.4 bytes per thread per memory request; but the address pattern, possibly caused by the stride between
threads, results in 6.0 sectors per request, or 6.0*32 = 191.3 bytes of cache data transfers per request.
The optimal thread address pattern for 3.4 byte accesses would result in 3.4*32 = 109.6 bytes of cache data
transfers per request, to maximize L1TEX cache performance. Check the Source Counters section for
uncoalesced global stores.
----- --------------------------------------------------------------------------------------------------------------
WRN The memory access pattern for global loads in L1TEX might not be optimal. On average, this kernel accesses
3.4 bytes per thread per memory request; but the address pattern, possibly caused by the stride between
threads, results in 6.8 sectors per request, or 6.8*32 = 216.3 bytes of cache data transfers per request.
The optimal thread address pattern for 3.4 byte accesses would result in 3.4*32 = 109.6 bytes of cache data
transfers per request, to maximize L1TEX cache performance. Check the Source Counters section for
uncoalesced global loads.
----- --------------------------------------------------------------------------------------------------------------
WRN The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to
L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only
accesses an average of 1.3 sectors out of the possible 4 sectors per cache line. Check the Source Counters
section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory
request.
----- --------------------------------------------------------------------------------------------------------------
WRN The memory access pattern for stores from L1TEX to L2 is not optimal. The granularity of an L1TEX request to
L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only
accesses an average of 2.8 sectors out of the possible 4 sectors per cache line. Check the Source Counters
section for uncoalesced stores and try to minimize how many cache lines need to be accessed per memory
request.
Section: Scheduler Statistics
---------------------------------------------------------------------- --------------- ------------------------------
One or More Eligible % 26.57
Issued Warp Per Scheduler 0.27
No Eligible % 73.43
Active Warps Per Scheduler warp 7.25
Eligible Warps Per Scheduler warp 0.94
---------------------------------------------------------------------- --------------- ------------------------------
WRN Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only
issues an instruction every 3.8 cycles. This might leave hardware resources underutilized and may lead to
less optimal performance. Out of the maximum of 12 warps per scheduler, this kernel allocates an average of
7.25 active warps per scheduler, but only an average of 0.94 warps were eligible per cycle. Eligible warps
are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible
warp results in no instruction being issued and the issue slot remains unused. To increase the number of
eligible warps, avoid possible load imbalances due to highly different execution durations per warp.
Reducing stalls indicated on the Warp State Statistics and Source Counters sections can help, too.
----- --------------------------------------------------------------------------------------------------------------
WRN The 7.25 theoretical warps per scheduler this kernel can issue according to its occupancy are below the
hardware maximum of 12. Use the Occupancy section to identify what limits this kernel's theoretical
occupancy.
Section: Warp State Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Warp Cycles Per Issued Instruction cycle 27.29
Warp Cycles Per Executed Instruction cycle 27.29
Avg. Active Threads Per Warp 28.71
Avg. Not Predicated Off Threads Per Warp 28.08
---------------------------------------------------------------------- --------------- ------------------------------
WRN On average, each warp of this kernel spends 18.1 cycles being stalled waiting for an MIO instruction queue to
be not full. This represents about 66.3% of the total average of 27.3 cycles between issuing two
instructions. This stall reason is high in cases of utilization of the MIO pipelines, which include special
math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory
accesses, trying to use fewer but wider loads can reduce pipeline pressure.
----- --------------------------------------------------------------------------------------------------------------
INF Check the Source Counters section for the top stall locations in your source based on sampling data. The
Kernel Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sampling) provides
more details on each stall reason.
Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler inst 2,821,237.75
Executed Instructions inst 180,559,216
Avg. Issued Instructions Per Scheduler inst 2,821,316.86
Issued Instructions inst 180,564,279
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel executes 464000 fused and 2784000 non-fused FP32 instructions. By converting pairs of non-fused
instructions to their fused (https://docs.nvidia.com/cuda/floating-point/#cuda-and-floating-point),
higher-throughput equivalent, the achieved FP32 performance could be increased by up to 43% (relative to its
current performance). Check the Source page to identify where this kernel executes FP32 instructions.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 905
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 16
Registers Per Thread register/thread 40
Shared Memory Configuration Size Kbyte 65.54
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 35.22
Threads thread 14,480
Waves Per SM 1
---------------------------------------------------------------------- --------------- ------------------------------
WRN Threads are executed in groups of 32 threads called warps. This kernel launch is configured to execute 905
threads per block. Consequently, some threads in a warp are masked off and those hardware resources are
unused. Try changing the number of threads per block to be a multiple of 32 threads. Between 128 and 256
threads per block is a good initial range for experimentation. Use smaller thread blocks rather than one
large thread block per multiprocessor if latency affects performance. This is particularly beneficial to
kernels that frequently call __syncthreads(). See the Hardware Model
(https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more
details on launch configurations.
----- --------------------------------------------------------------------------------------------------------------
WRN If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the
achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the
hardware busy.
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 16
Block Limit Registers block 1
Block Limit Shared Mem block 1
Block Limit Warps block 1
Theoretical Active Warps per SM warp 29
Theoretical Occupancy % 60.42
Achieved Occupancy % 60.42
Achieved Active Warps Per SM warp 29.00
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel's theoretical occupancy (60.4%) is limited by the required amount of shared memory This kernel's
theoretical occupancy (60.4%) is limited by the number of warps within each block This kernel's theoretical
occupancy (60.4%) is limited by the number of required registers See the CUDA Best Practices Guide
(https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on
optimizing occupancy.
Section: Source Counters
---------------------------------------------------------------------- --------------- ------------------------------
Branch Instructions Ratio % 0.01
Branch Instructions inst 1,722,928
Branch Efficiency % 93.18
Avg. Divergent Branches 1,251
---------------------------------------------------------------------- --------------- ------------------------------
WRN This kernel has uncoalesced global accesses resulting in a total of 791680 excessive sectors (18% of the
total 4366496 sectors). Check the L2 Theoretical Sectors Global Excessive table for the primary source
locations. The CUDA Programming Guide
(https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses) had additional
information on reducing uncoalesced device memory accesses.
----- --------------------------------------------------------------------------------------------------------------
WRN This kernel has uncoalesced shared accesses resulting in a total of 85712000 excessive wavefronts (56% of the
total 153524192 wavefronts). Check the L1 Wavefronts Shared Excessive table for the primary source
locations. The CUDA Best Practices Guide
(https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory-in-matrix-multiplication-c
-aa) has an example on optimizing shared memory accesses.