High shared memory usage but low l1tex__data_bank_reads

I have a kernel with a high l1tex__lsuin_request.avg.pct_of_peake_sustained_elapsed.

I think the lsuin requests are mostly for shared memory, because both the memory chart and memory table can confirm this


I also wrote this kernel so it makes sense to me that it’s heavy on shared memory read. what I don’t understand is that it has very small l1tex__data_bank_reads.avg.pct_of_peak_sustained_elapsed. To be honest, I don’t really understand what this metric is measuring.

I searched the profiler documentation with data_bank_reads and found nothing ( Search — NsightCompute 12.6 documentation (nvidia.com)). I also watched the 2 GTC videos here but it didn’t mention this specific metric.

Greg also mentioned about this metric in another post but he didn’t mention the definition of this metric.

So I guess my question is what l1tex_data_brank_reads measures and how’s it different from l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum? If l1tex_data_bank_reads is the sum of shared memory reads from the 32 data banks, and the l1 cache reads from the 32 data banks, why is it so low when this kernel is bottlenecked at share memory reads?

You are mostly reading directly from global memory into shared memory (LDGSTS)?

yeah, LDGSTS is used and it’s pretty heavy global read as well, but it’s not the bottleneck I think? Given the L1TEX’s Lsuin is reading its maximum throughput first

l1tex__data_bank_reads.avg is the number of data banks (0-32) read during global, local, texture, surface, or shared reads.

// CASE 1: Every thread will access consecutive 32-bits
__shared__ uint32_t data[1024];
uint32_t value = data[threadIdx.x];  // assume all threads are active and predicated on

// - l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld = 1 per warp
// - l1tex__data_bank_reads = 32 per warp

// CASE 2: Every thread accesses the same data
__shared__ uint32_t data[1024];
uint32_t value = data[0];  // assume all threads are active and predicated on

// - l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld = 1 per warp
// - l1tex__data_bank_reads = 1 per warp

// CASE 3: Every thread accesses bank0 on different line resulting in 32 wavefronts
__shared__ uint32_t data[1024];
uint32_t value = data[threadIdx.x * 32];  // assume all threads are active and predicated on

// - l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld = 32 per warp
// - l1tex__data_bank_reads = 32 per warp (1 per wavefront)

On some GPUs more than 32 banks can be accessed per cycle if multiple clients are accessing the SRAM (e.g. global/local/texture/surface and shared).

2 Likes

Thanks @Greg for the detailed explanation! This makes a lot of sense

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.