Unexpected shared memory bank conflict.

dyanab · June 28, 2019, 4:25pm

I observed shared memory bank conflict in the following case:

Using 128-bit shared load on an RTX GPU.
Byte offset = tid % 4 * 16

That is:
tid, byte offset
0, 0
1, 16
2, 32
3, 48
4, 0
5, 16
6, 32
7, 48
…

I assume that there is no bank conflict in this case because data at location 0 will be broadcasted to thread 0, 4, 8, 12, 16, 20, 24, 28.
But profiling result shows that there are 4 requests generated for each LDS.U.128 instruction. (I expect 2 requests.)

How should I explain this?

Thanks!

Greg · July 1, 2019, 2:59pm

On CC 7.0 - 7.5 devices shared memory loads with uniform addresses can increase bandwidth if the following is true:

Thread pairs (Tn and Tn^1) have the same addresses for all active threads (i.e. T0==T1, T2==T3, T4==T5, T6==T7, etc.), or
Thread pairs (Tn and Tn^2) have the same addresses for all active threads (i.e. T0==T2, T1==T3, T4==T6, T5==T7, etc.)
(Note these encompass the case where all active threads have the same address)

The return bandwidth of L1 and shared memory is 128 bytes/cycle == 1 register/cycle. The uniform access can achieve 2x the performance by packing the return registers. In the case above the rules are not met so LDS.128 requires 4 requests per instruction as it returns 4 registers. If the guidelines are met the requests/instruction should reduce from 4 to 2.

dyanab · July 3, 2019, 3:31am

I have changed the access pattern according to the “thread pairing” rule. The bank conflict now disappears. Thank you!

Topic		Replies	Views
Shared load and store trancactions behavior CUDA Programming and Performance hw , cuda	2	452	November 8, 2021
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	9513	January 16, 2025
How to explain this bank conflict CUDA Programming and Performance	1	707	September 20, 2013
Understanding the behaivor of ldmatrix in terms of shared memory access CUDA Programming and Performance cuda	2	1423	January 12, 2024
bank conflict in cuda's parallel prefix scan GPU-Accelerated Libraries	1	1889	February 12, 2016
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	798	November 20, 2023
Bank Conflict when each thread accesses 2 elements CUDA Programming and Performance	8	5581	July 9, 2010
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3466	August 20, 2009
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	892	December 29, 2011
CUDA Reduction CUDA Programming and Performance	2	1750	March 1, 2009

Unexpected shared memory bank conflict.

Related topics