Shared memory: Optimizing vectorized accesses vs bank conflicts

tobiass · July 18, 2024, 11:03am

Hi everyone,

I am struggling with the following problem for quite a while now. Basically, I have a struct of two floats (complex numbers) and I am trying to figure out the best memory access pattern when loading those numbers from shared memory.

struct __align__(2*sizeof(float)) complex {
  float real, imag;
};

The following questions emerged, when I thought about it:

Memory bank conflicts happen within one warp, but do they also happen to threads within one block but not within one warp? E.g. would array[threadIdx.x]++ lead to a 4-way conflict if the block had 128 threads and array was of type __shared__ float array[], since the scheduler could in principal run all 128 threads in parallel?
If I now have a shared array of the aforementioned complex numbers and I want to perform some operations on them, e.g., multiplying with some complex constant z, I could go two ways:
1. Optimize memory bank conflicts: Just go array[threadIdx.x] *= z. The PTX results in {ld, st}.shared .v2 instructions. I therefore have a 2-way bank conflict within one warp, but also a v2 vectorization.
2. Optimize vectorization: Do something like array[threadIdx.x*2] *= z and array[threadIdx.x*2 +1]. This leads to v4 vectorization but also a 4 way-bank conflict (or does it?).

I am very unsure on how vectorized shared memory instructions interact with memory bank conflicts. I would be greatly thankful for an answer or any hint to material on this topic.

Best,
Tobias

Curefab · July 18, 2024, 12:50pm

Access to shared memory from the 4 SM Partitions (SMSP), and between different warps in general, is serialized. You will never get bank conflicts (and neither any advantages by coordinated access patterns) from that source.

When doing vectorized accesses (v2, v4), then two-way and four-way bank conflicts are fully acceptable, as the transaction takes the respective amount longer anyway.

For the v4 case, you really have to make sure that the assembler does the v4 optimization by itself by combining the two accesses, otherwise you can define a complex2 type for memory accesses complex c2 = array.ascomplex2[threadIdx.x], and in it do component-wise multiplication c2.first.x *= z. The array could be a union type for different access sizes.

BTW typically you define __shared__ memory as volatile. Synchronization instructions are not enough, are they?

tobiass · July 19, 2024, 9:45am

Thank you for your answer! That clears up a lot. I was not quite aware of the concept of transactions. For anyone wondering, I found out more about this at Memory Transactions

I found that the compiler does a really good job finding vectorized accesses without them being explicit in code. Though a more descriptive code might not be a bad thing. I really like the .asType idea, I will definetly adopt that.

Regarding volatile, I never had any problems with shared memory without it.

Curefab · July 19, 2024, 11:04am

Thank you for sharing the link to the article about Memory Transactions.

The ‘danger’ of not using volatile is that the compiler can keep accesses cached in local registers or skip them. As long as you are using each single shared memory index only for one dedicated thread (e.g. for short-time manual data caching) and not read or write there with other threads, it is okay.

If you want to share data, the compiler+assembler could decide that accesses are not needed, because it is enough to store data in registers.

__shared__ int sharedmem; // without volatile
sharedmem = 0;
__syncwarp(); // we sync the writes to shared memory
if (threadIdx.x == 0)
    sharedmem = 2;
__syncwarp(); // we sync the writes to shared memory
if (threadIdx.x != 0)
    int read = sharedmem;
    f(read);
else
    int read = sharedmem;
    f(read);

The compiler perhaps knows, that threadIdx.x does not change within a thread.
It can deduce that in the case of threadIdx.x != 0, the memory is always 0. So it decides, not to write the memory for this branch and not to read it back. In the case of threadIdx.x == 0the memory is always 2, so this also does not have to be written or read back.

All accesses to shared memory would be removed.

A programmer would expect all threads to read 2, but that does not happen for threads 1..31, only for thread 0.

The volatile ensures that reads and writes are actually always done and not cached locally. This non-volatile analysis of the compiler does not consider the other threads. It is independent from the synchronization instructions.

So in nearly all cases (except when the memory locations are really separated by thread), you should use volatile for shared memory.

Using volatile on the other hand could worsen the vectorizing.

For absolute full control you can always call inline functions or PTX asm blocks.

Topic		Replies	Views
Shared memory avoiding bank conflict less effective CUDA Programming and Performance	3	3783	May 6, 2010
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	16046	November 19, 2025
Special cases of Shared Memory Bank Conflict CUDA Programming and Performance	4	113	October 1, 2025
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	14	3716	November 20, 2025
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	271	June 23, 2025
Shared memory bank conflict CUDA Programming and Performance	4	599	July 30, 2025
Resolve 1D shared memory bank conflict with paddling CUDA Programming and Performance cuda , kernel	9	379	September 1, 2024
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	973	December 29, 2011
Shared memory access patterns CUDA Programming and Performance	2	1177	March 4, 2010
Bank Conflicts CUDA Programming and Performance	2	2029	December 6, 2009

Shared memory: Optimizing vectorized accesses vs bank conflicts

Related topics