Variable run time for cuda kernel

I have a CUDA kernel running slowly on a Jetson ORIN AGX. Sometimes it takes 90ms, 60ms, 30ms, 20ms. I can’t seem to find the reason why this happens. Are there any markers or events that I can add to the cuda kernel to measure how long certain parts of the kernel take?

I use the following code to measure the time.
auto start = std::chrono::high_resolution_clock::now();
kernel_call<<<>>>
cudaDeviceSynchronize();
auto stop= std::chrono::high_resolution_clock::now();

When I measure the time from inside the kernel using clock64() and divide it by the clock rate, the values computed don’t compare to the timing values calculated outside the kernel.

Below is the output from the Nsight Compute CLI. I’m unsure as to which part of the code to optimize from this output.

Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
SM Frequency cycle/nsecond 1.14
Elapsed Cycles cycle 10,793,953
Memory [%] % 90.63
Duration msecond 9.42
L1/TEX Cache Throughput % 91.59
L2 Cache Throughput % 3.08
SM Active Cycles cycle 10,621,348.44
Compute (SM) [%] % 40.31
---------------------------------------------------------------------- --------------- ------------------------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing L1 in the Memory Workload Analysis section.

INF   The ratio of peak float (fp32) to double (fp64) performance on this device is 64:1. The kernel achieved       
      close to 0% of this device's fp32 peak performance and 0% of its fp64 peak performance. See the Kernel        
      Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline) for more details  
      on roofline analysis.                                                                                         

Section: Compute Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Executed Ipc Active                                                         inst/cycle                           1.06
Executed Ipc Elapsed                                                        inst/cycle                           1.05
Issue Slots Busy                                                                     %                          26.56
Issued Ipc Active                                                           inst/cycle                           1.06
SM Busy                                                                              %                          26.56
---------------------------------------------------------------------- --------------- ------------------------------
      No compute pipeline is over-utilized.                                                                         

Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Mem Busy                                                                             %                          90.63
Max Bandwidth                                                                        %                          40.31
L1/TEX Hit Rate                                                                      %                          79.29
L2 Compression Success Rate                                                          %                              0
L2 Compression Ratio                                                                                                0
L2 Hit Rate                                                                          %                          98.58
Mem Pipes Busy                                                                       %                          40.31
---------------------------------------------------------------------- --------------- ------------------------------

WRN   The memory access pattern for global stores in L1TEX might not be optimal. On average, this kernel accesses   
      3.4 bytes per thread per memory request; but the address pattern, possibly caused by the stride between       
      threads, results in 6.0 sectors per request, or 6.0*32 = 191.3 bytes of cache data transfers per request.     
      The optimal thread address pattern for 3.4 byte accesses would result in 3.4*32 = 109.6 bytes of cache data   
      transfers per request, to maximize L1TEX cache performance. Check the Source Counters section for             
      uncoalesced global stores.                                                                                    
----- --------------------------------------------------------------------------------------------------------------
WRN   The memory access pattern for global loads in L1TEX might not be optimal. On average, this kernel accesses    
      3.4 bytes per thread per memory request; but the address pattern, possibly caused by the stride between       
      threads, results in 6.8 sectors per request, or 6.8*32 = 216.3 bytes of cache data transfers per request.     
      The optimal thread address pattern for 3.4 byte accesses would result in 3.4*32 = 109.6 bytes of cache data   
      transfers per request, to maximize L1TEX cache performance. Check the Source Counters section for             
      uncoalesced global loads.                                                                                     
----- --------------------------------------------------------------------------------------------------------------
WRN   The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to   
      L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only  
      accesses an average of 1.3 sectors out of the possible 4 sectors per cache line. Check the Source Counters    
      section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory         
      request.                                                                                                      
----- --------------------------------------------------------------------------------------------------------------
WRN   The memory access pattern for stores from L1TEX to L2 is not optimal. The granularity of an L1TEX request to  
      L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only  
      accesses an average of 2.8 sectors out of the possible 4 sectors per cache line. Check the Source Counters    
      section for uncoalesced stores and try to minimize how many cache lines need to be accessed per memory        
      request.                                                                                                      

Section: Scheduler Statistics
---------------------------------------------------------------------- --------------- ------------------------------
One or More Eligible                                                                 %                          26.57
Issued Warp Per Scheduler                                                                                        0.27
No Eligible                                                                          %                          73.43
Active Warps Per Scheduler                                                        warp                           7.25
Eligible Warps Per Scheduler                                                      warp                           0.94
---------------------------------------------------------------------- --------------- ------------------------------
WRN   Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only      
      issues an instruction every 3.8 cycles. This might leave hardware resources underutilized and may lead to     
      less optimal performance. Out of the maximum of 12 warps per scheduler, this kernel allocates an average of   
      7.25 active warps per scheduler, but only an average of 0.94 warps were eligible per cycle. Eligible warps    
      are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible   
      warp results in no instruction being issued and the issue slot remains unused. To increase the number of      
      eligible warps, avoid possible load imbalances due to highly different execution durations per warp.          
      Reducing stalls indicated on the Warp State Statistics and Source Counters sections can help, too.            
----- --------------------------------------------------------------------------------------------------------------
WRN   The 7.25 theoretical warps per scheduler this kernel can issue according to its occupancy are below the       
      hardware maximum of 12. Use the Occupancy section to identify what limits this kernel's theoretical           
      occupancy.                                                                                                    

Section: Warp State Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Warp Cycles Per Issued Instruction                                               cycle                          27.29
Warp Cycles Per Executed Instruction                                             cycle                          27.29
Avg. Active Threads Per Warp                                                                                    28.71
Avg. Not Predicated Off Threads Per Warp                                                                        28.08
---------------------------------------------------------------------- --------------- ------------------------------
WRN   On average, each warp of this kernel spends 18.1 cycles being stalled waiting for an MIO instruction queue to 
      be not full. This represents about 66.3% of the total average of 27.3 cycles between issuing two              
      instructions. This stall reason is high in cases of utilization of the MIO pipelines, which include special   
      math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory      
      accesses, trying to use fewer but wider loads can reduce pipeline pressure.                                   
----- --------------------------------------------------------------------------------------------------------------
INF   Check the Source Counters section for the top stall locations in your source based on sampling data. The      
      Kernel Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sampling) provides   
      more details on each stall reason.                                                                            

Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler                                          inst                   2,821,237.75
Executed Instructions                                                             inst                    180,559,216
Avg. Issued Instructions Per Scheduler                                            inst                   2,821,316.86
Issued Instructions                                                               inst                    180,564,279
---------------------------------------------------------------------- --------------- ------------------------------
WRN   This kernel executes 464000 fused and 2784000 non-fused FP32 instructions. By converting pairs of non-fused   
      instructions to their fused (https://docs.nvidia.com/cuda/floating-point/#cuda-and-floating-point),           
      higher-throughput equivalent, the achieved FP32 performance could be increased by up to 43% (relative to its  
      current performance). Check the Source page to identify where this kernel executes FP32 instructions.         

Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size                                                                                                        905
Function Cache Configuration                                                                  cudaFuncCachePreferNone
Grid Size                                                                                                          16
Registers Per Thread                                                   register/thread                             40
Shared Memory Configuration Size                                                 Kbyte                          65.54
Driver Shared Memory Per Block                                             Kbyte/block                           1.02
Dynamic Shared Memory Per Block                                             byte/block                              0
Static Shared Memory Per Block                                             Kbyte/block                          35.22
Threads                                                                         thread                         14,480
Waves Per SM                                                                                                        1
---------------------------------------------------------------------- --------------- ------------------------------
WRN   Threads are executed in groups of 32 threads called warps. This kernel launch is configured to execute 905    
      threads per block. Consequently, some threads in a warp are masked off and those hardware resources are       
      unused. Try changing the number of threads per block to be a multiple of 32 threads. Between 128 and 256      
      threads per block is a good initial range for experimentation. Use smaller thread blocks rather than one      
      large thread block per multiprocessor if latency affects performance.  This is particularly beneficial to     
      kernels that frequently call __syncthreads(). See the Hardware Model                                          
      (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-hw-model) description for more      
      details on launch configurations.                                                                             
----- --------------------------------------------------------------------------------------------------------------
WRN   If you execute __syncthreads() to synchronize the threads of a block, it is recommended to have more than the 
      achieved 1 blocks per multiprocessor. This way, blocks that aren't waiting for __syncthreads() can keep the   
      hardware busy.                                                                                                

Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM                                                                   block                             16
Block Limit Registers                                                            block                              1
Block Limit Shared Mem                                                           block                              1
Block Limit Warps                                                                block                              1
Theoretical Active Warps per SM                                                   warp                             29
Theoretical Occupancy                                                                %                          60.42
Achieved Occupancy                                                                   %                          60.42
Achieved Active Warps Per SM                                                      warp                          29.00
---------------------------------------------------------------------- --------------- ------------------------------
WRN   This kernel's theoretical occupancy (60.4%) is limited by the required amount of shared memory This kernel's  
      theoretical occupancy (60.4%) is limited by the number of warps within each block This kernel's theoretical   
      occupancy (60.4%) is limited by the number of required registers See the CUDA Best Practices Guide            
      (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy) for more details on           
      optimizing occupancy.                                                                                         

Section: Source Counters
---------------------------------------------------------------------- --------------- ------------------------------
Branch Instructions Ratio                                                            %                           0.01
Branch Instructions                                                               inst                      1,722,928
Branch Efficiency                                                                    %                          93.18
Avg. Divergent Branches                                                                                         1,251
---------------------------------------------------------------------- --------------- ------------------------------
WRN   This kernel has uncoalesced global accesses resulting in a total of 791680 excessive sectors (18% of the      
      total 4366496 sectors). Check the L2 Theoretical Sectors Global Excessive table for the primary source        
      locations. The CUDA Programming Guide                                                                         
      (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses) had additional      
      information on reducing uncoalesced device memory accesses.                                                   
----- --------------------------------------------------------------------------------------------------------------
WRN   This kernel has uncoalesced shared accesses resulting in a total of 85712000 excessive wavefronts (56% of the 
      total 153524192 wavefronts). Check the L1 Wavefronts Shared Excessive table for the primary source            
      locations. The CUDA Best Practices Guide                                                                      
       (https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#shared-memory-in-matrix-multiplication-c
      -aa) has an example on optimizing shared memory accesses.

Hi,

Have you locked the GPU to the maximal frequency before profiling?
Jetson’s default uses dynamic frequency so the performance will vary in different runs.

You can lock the GPU clock to max with the following command:

$ sudo nvpmodel -m 0  #can be other power modes if there is a power budget
$ sudo jetson_clocks

CUDA event can be used to measure the duration of kernel code.
Below is an example for your reference:

Thanks.

Hi,

This works, reducing time significantly and the time is consistent.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.