I’m trying to understand the exact timing of shared memory bank access when a single thread performs a FLOAT4 operation.
Specific Question: When a thread executes:
cuda
FLOAT4(s_a[0][0]) = FLOAT4(data);
Are banks 0, 1, 2, and 3 occupied simultaneously within the same clock cycle, or are they accessed sequentially across multiple cycles?
Context: I’m analyzing bank conflicts in SGEMM kernels where threads use FLOAT4 to load data into shared memory. Understanding whether the 4 banks are truly accessed in parallel affects how I interpret bank conflict patterns.
My Hypothesis: I suspect that these 4 banks are NOT occupied simultaneously. Instead, they might be accessed sequentially (e.g., bank 0 in cycle 1, bank 1 in cycle 2, etc.). This would explain why we see systematic bank conflicts in regular access patterns.
For example, in a typical SGEMM pattern:
t0 accesses banks 0-3
t8 also tries to access banks 0-3, creating conflicts
If banks aren’t occupied simultaneously, then BCF optimizations work by staggering access patterns - allowing t8 to access bank 1 while t0 is accessing bank 0, creating an interleaved pattern that reduces conflicts.
GPU Architecture: A100 (if relevant to the answer)
=====
Any clarification on the actual hardware behavior would be helpful. References to official documentation would also be appreciated.
I was initially confused about why bank conflict optimization would be beneficial if all 32 banks are already being utilized. From a pure throughput perspective, if all banks are occupied, wouldn’t that mean we’re already achieving maximum shared memory bandwidth? Even with conflicts, the total data throughput per transaction should remain the same since all banks are still being used.
All banks being utilized (for a single transaction) and positing bank conflicts doesn’t really make sense to me. For float4 access, if all banks are being utilized, that would imply to me that the first 8 threads in the warp are each requesting a separate float4 pertaining to separate banks (for example contiguous access across threads, although that is not the only suitable pattern). In that case I don’t think it is possible to have bank conflicts.
Yes, that scenario exists. That’s exactly what happens. However shared memory can only serve 128 bytes per cycle. So there is no loss of efficiency.
On a float4 access warp-wide, un-bank-conflicted, it will require 4 cycles to shared, each cycle delivering 128 bytes.
I’m not sure what bank-conflicted access pattern you would have in mind, but I doubt adding bank conflicts in would be as efficient.
Let’s consider two cases:
float4 access warp wide. threads 0-7 access banks 0-31. Threads 8-15 access banks 0-31, threads 16-23 access banks 0-31. Threads 24-31 access banks 0-31. Warp-wide, it appears like there is columnar access, which would normally suggest bank conflicts. There are no bank conflicts because bank conflicts are only applicable to the threads within a single transaction, not warp-wide in this case. The entire servicing process for this request, considered warp-wide, takes 4 cycles.
float4 access warp wide. Threads 0-3 access banks 0-15, and threads 4-7 access banks 0-15. Threads 8-11 access banks 16-31, and threads 12-15 access banks 16-31. Threads 16-31 repeat the pattern of threads 0-15. In this case, the GPU still breaks the request into 4 transactions of 8 threads each, but now the transaction associated with threads 0-7 is two-way-bank-conflicted. Likewise for the other 3 transactions. This entire request warp-wide takes 8 cycles to complete, rather than 4.
Note: Modern profiler-speak refers to the above thing as a “wavefront” instead of a transaction.