FLOAT4 shared memory access: Do banks 0-3 get occupied simultaneously by a single thread?

adoni1203 · August 29, 2025, 10:02am

I’m trying to understand the exact timing of shared memory bank access when a single thread performs a FLOAT4 operation.

Specific Question: When a thread executes:

cuda

FLOAT4(s_a[0][0]) = FLOAT4(data);

Are banks 0, 1, 2, and 3 occupied simultaneously within the same clock cycle, or are they accessed sequentially across multiple cycles?

Context: I’m analyzing bank conflicts in SGEMM kernels where threads use FLOAT4 to load data into shared memory. Understanding whether the 4 banks are truly accessed in parallel affects how I interpret bank conflict patterns.

My Hypothesis: I suspect that these 4 banks are NOT occupied simultaneously. Instead, they might be accessed sequentially (e.g., bank 0 in cycle 1, bank 1 in cycle 2, etc.). This would explain why we see systematic bank conflicts in regular access patterns.

For example, in a typical SGEMM pattern:

t0 accesses banks 0-3
t8 also tries to access banks 0-3, creating conflicts

If banks aren’t occupied simultaneously, then BCF optimizations work by staggering access patterns - allowing t8 to access bank 1 while t0 is accessing bank 0, creating an interleaved pattern that reduces conflicts.

GPU Architecture: A100 (if relevant to the answer)

=====

Any clarification on the actual hardware behavior would be helpful. References to official documentation would also be appreciated.

striker159 · August 29, 2025, 10:44am

accesses to shared memory are broken up into 128-byte hardware transactions. (4byte * 32banks). A bank conflict occurs within such a transaction

Using float4 will use all four banks in a single transaction.

This post explains bank conflicts in more detail:

adoni1203 · August 29, 2025, 12:34pm

Thanks! But I’m still confused.

I was initially confused about why bank conflict optimization would be beneficial if all 32 banks are already being utilized. From a pure throughput perspective, if all banks are occupied, wouldn’t that mean we’re already achieving maximum shared memory bandwidth? Even with conflicts, the total data throughput per transaction should remain the same since all banks are still being used.

Robert_Crovella · August 29, 2025, 2:10pm

All banks being utilized (for a single transaction) and positing bank conflicts doesn’t really make sense to me. For float4 access, if all banks are being utilized, that would imply to me that the first 8 threads in the warp are each requesting a separate float4 pertaining to separate banks (for example contiguous access across threads, although that is not the only suitable pattern). In that case I don’t think it is possible to have bank conflicts.

adoni1203 · August 29, 2025, 3:16pm

My point is:

t0 to t7 access banks 0 to 31 without conflict in the first clock cycle (t0 access bank 0 to bank 3 in the same time by using float4).

For t8 to t15, during the first clock cycle, since all banks are already occupied, they have to wait for the next clock cycle.

Does this scenario exist?

Robert_Crovella · August 29, 2025, 3:34pm

Yes, that scenario exists. That’s exactly what happens. However shared memory can only serve 128 bytes per cycle. So there is no loss of efficiency.

On a float4 access warp-wide, un-bank-conflicted, it will require 4 cycles to shared, each cycle delivering 128 bytes.

I’m not sure what bank-conflicted access pattern you would have in mind, but I doubt adding bank conflicts in would be as efficient.

Let’s consider two cases:

float4 access warp wide. threads 0-7 access banks 0-31. Threads 8-15 access banks 0-31, threads 16-23 access banks 0-31. Threads 24-31 access banks 0-31. Warp-wide, it appears like there is columnar access, which would normally suggest bank conflicts. There are no bank conflicts because bank conflicts are only applicable to the threads within a single transaction, not warp-wide in this case. The entire servicing process for this request, considered warp-wide, takes 4 cycles.
float4 access warp wide. Threads 0-3 access banks 0-15, and threads 4-7 access banks 0-15. Threads 8-11 access banks 16-31, and threads 12-15 access banks 16-31. Threads 16-31 repeat the pattern of threads 0-15. In this case, the GPU still breaks the request into 4 transactions of 8 threads each, but now the transaction associated with threads 0-7 is two-way-bank-conflicted. Likewise for the other 3 transactions. This entire request warp-wide takes 8 cycles to complete, rather than 4.

Note: Modern profiler-speak refers to the above thing as a “wavefront” instead of a transaction.

Curefab · August 31, 2025, 3:37am

It is not called a bank conflict, although two cycles are needed.

Topic		Replies	Views
float4 Shared memory doesn't yield bank conflict according to nvprof when it should CUDA Programming and Performance	4	2047	January 13, 2024
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	15175	November 19, 2025
Bank Conflicts CUDA Programming and Performance	2	2019	December 6, 2009
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	219	June 23, 2025
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	14	3684	November 20, 2025
Shared memory bank conflict reordering CUDA Programming and Performance	1	105	July 16, 2025
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1324	February 26, 2009
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	5116	January 23, 2024
beginner question regarding shared memory CUDA Programming and Performance	4	7021	November 16, 2009
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	963	December 29, 2011

FLOAT4 shared memory access: Do banks 0-3 get occupied simultaneously by a single thread?

Related topics