Requesting clarification for Shared Memory Bank Conflicts and Shared memory access?

Robert_Crovella · January 22, 2024, 4:12pm

The limitations can be directly inferred from the statements already provided. If two (or more) threads in a warp, for a single warp-issued instruction, access different 32-bit locations in the same bank, regardless of their access width (8-bit, 16-bit, or 32-bit), then that will result in a bank conflict.

However there is the broadcast rule. Two or more threads accessing the same location do not generate a bank conflict. This is also true regardless of the access width (8-bit, 16-bit, or 32-bit) and is even true if, for example, different 8-bit locations in the same 32-bit word are being accessed by different threads.

Coupled with that, as already stated, the maximum bandwidth of shared memory (I believe, post-Kepler) is one 32 bit quantity/location per bank, per cycle (per SM). Accessing 8-bit quantities or 16-bit quantities per thread will necessarily reduce the maximum achievable bandwidth by a factor of one-half (16-bit) or one-quarter (8-bit).

There are also other chip-specific factors which may impact whether a full bandwidth of 128 bytes per access (32 threads in a warp times 32 bits per thread) can be achieved.

I’m not sure what you mean. 8-bit or 16-bit access is supported in a fashion similar to 32-bit, as already covered in my comments above. At some point, I cannot proceed any further with questions of the form “why is it this way?” I have given a behavioral description. Beyond that, I will eventually end up at the answer “because that is the way the GPU designers chose to design it”.

Topic		Replies	Views
Using Shared Memory in CUDA C/C++ Technical Blog	36	1926	October 8, 2020
Requesting clarification for Non contiguous shared memory access by threads of a warp with no bank conflicts CUDA Programming and Performance hw , cuda	5	374	February 21, 2024
Kernel launch failure plus Warp execution performance CUDA Programming and Performance hw , cuda	13	610	May 9, 2024
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4697	June 22, 2011
How warp serialization works on shared memory How to run a "data[n] += something" efficientl CUDA Programming and Performance	26	3266	May 26, 2010
More threads/block increase kernel execution time. WHY? CUDA Programming and Performance	51	8222	June 17, 2011
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3304	January 10, 2010
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	723	November 20, 2023
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1550	June 15, 2011
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	972	November 17, 2017

Requesting clarification for Shared Memory Bank Conflicts and Shared memory access?

Related topics