Unexpected shared memory bank conflict.

Greg · July 1, 2019, 2:59pm

On CC 7.0 - 7.5 devices shared memory loads with uniform addresses can increase bandwidth if the following is true:

Thread pairs (Tn and Tn^1) have the same addresses for all active threads (i.e. T0==T1, T2==T3, T4==T5, T6==T7, etc.), or
Thread pairs (Tn and Tn^2) have the same addresses for all active threads (i.e. T0==T2, T1==T3, T4==T6, T5==T7, etc.)
(Note these encompass the case where all active threads have the same address)

The return bandwidth of L1 and shared memory is 128 bytes/cycle == 1 register/cycle. The uniform access can achieve 2x the performance by packing the return registers. In the case above the rules are not met so LDS.128 requires 4 requests per instruction as it returns 4 registers. If the guidelines are met the requests/instruction should reduce from 4 to 2.

Topic		Replies	Views
128-bit access bank conflict CUDA Programming and Performance	11	1336	March 29, 2024
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	15921	November 19, 2025
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	5256	January 23, 2024
Shared memory bank conflict CUDA Programming and Performance	4	587	July 30, 2025
Understanding the behaivor of ldmatrix in terms of shared memory access CUDA Programming and Performance cuda	2	1956	January 12, 2024
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	267	June 23, 2025
Bank Conflict when each thread accesses 2 elements CUDA Programming and Performance	8	5724	July 9, 2010
How to explain this bank conflict CUDA Programming and Performance	1	736	September 20, 2013
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2799	March 31, 2010
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	14	3706	November 20, 2025

Unexpected shared memory bank conflict.

Related topics