Improving 'Stall Long Scoreboard' by warp level communication


I am writing a CUDA kernel for research. The kernel has a lot of inputs from device memory and does not use shared memory. Upon profiling I see problems in “Issue slot utilization” and “Stall Long Scoreboard”. The log is:

kernel(int, int, float, int const*, int const*, int const*, int const*, int const*, float const*, float const*, float const*, float const*, float const*, float const*, float*, float*, float*, float*, float*), 2021-Oct-21 22:14:48, Context 1, Stream 7
Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput Gbyte/second 582.21
Mem Busy % 21.56
Max Bandwidth % 62.40
L1/TEX Hit Rate % 1.51
L2 Hit Rate % 24.07
Mem Pipes Busy % 1.68
---------------------------------------------------------------------- --------------- ------------------------------

Section: Scheduler Statistics
---------------------------------------------------------------------- --------------- ------------------------------
One or More Eligible                                                                 %                           3.26
Issued Warp Per Scheduler                                                                                        0.03
No Eligible                                                                          %                          96.74
Active Warps Per Scheduler                                                        warp                          14.00
Eligible Warps Per Scheduler                                                      warp                           0.05
---------------------------------------------------------------------- --------------- ------------------------------
WRN   Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only      
      issues an instruction every 30.6 cycles. This might leave hardware resources underutilized and may lead to    
      less optimal performance. Out of the maximum of 16 warps per scheduler, this kernel allocates an average of   
      14.00 active warps per scheduler, but only an average of 0.05 warps were eligible per cycle. Eligible warps   
      are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible   
      warp results in no instruction being issued and the issue slot remains unused. To increase the number of      
      eligible warps either increase the number of active warps or reduce the time the active warps are stalled.    

Section: Warp State Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Warp Cycles Per Issued Instruction                                               cycle                         428.88
Warp Cycles Per Executed Instruction                                             cycle                         435.14
Avg. Active Threads Per Warp                                                                                    31.88
Avg. Not Predicated Off Threads Per Warp                                                                        26.59
---------------------------------------------------------------------- --------------- ------------------------------
WRN   On average each warp of this kernel spends 306.1 cycles being stalled waiting for a scoreboard dependency on  
      a L1TEX (local, global, surface, texture) operation. This represents about 71.4% of the total average of      
      428.9 cycles between issuing two instructions. To reduce the number of cycles waiting on L1TEX data accesses  
      verify the memory access patterns are optimal for the target architecture, attempt to increase cache hit      
      rates by increasing data locality or by changing the cache configuration, and consider moving frequently      
      used data to shared memory.                                                                                   

In my algorithm multiple consecutive threads are reading from the same address (although in a coalesced manner) in device memory. That can still burden the L1 cache. Do you think that register sharing within warps instead of reading from device memory can help in any improvement?

Thanks a lot!

The kernel appears to be latency bound not L1 bound. You can verify by checking l1tex__throughput breakdown. My expectation is that l1tex__throughput.avg.pct_of_peak_sustained_active is fairly low. The low L1 and L2 hit rate indicates that most load operations are reaching DRAM which is also indicated by the 400+ cycle average instruction latency.

In order to improve this you can
a. batch loads so the warp does not serially wait for multiple loads
b. reduce the data read by compressing data.
c. exploit any potential for data re-use between warps in the same thread block.

Hi Greg,

Thanks for your insights. Could you please clarify the following:

  1. Batch loads: Do you mean combining elements as array of struct and loading the whole struct?

  2. Data re-use: Actually I am doing key based min reduction and am currently using atomics for that. Can you point me towards an example or something I can follow to write a key based reduction instead of usual reduction through warp aggregation? I see some examples using CUB but I think they require dynamic parallelism. Do you think it is possible to conceive key-based reduction by just warp communication? (For simplicity lets also assume that the key is the index where the output should be written at).

Thanks again!

Batching memory operations means to try to load multiple items of data at the same time as opposed to interleaving compute and memory operations.

Other forum members will have to point you to the state of art in reductions.