Shared memory: Optimizing vectorized accesses vs bank conflicts

Hi everyone,

I am struggling with the following problem for quite a while now. Basically, I have a struct of two floats (complex numbers) and I am trying to figure out the best memory access pattern when loading those numbers from shared memory.

struct __align__(2*sizeof(float)) complex {
  float real, imag;
};

The following questions emerged, when I thought about it:

  1. Memory bank conflicts happen within one warp, but do they also happen to threads within one block but not within one warp? E.g. would array[threadIdx.x]++ lead to a 4-way conflict if the block had 128 threads and array was of type __shared__ float array[], since the scheduler could in principal run all 128 threads in parallel?
  2. If I now have a shared array of the aforementioned complex numbers and I want to perform some operations on them, e.g., multiplying with some complex constant z, I could go two ways:
    1. Optimize memory bank conflicts: Just go array[threadIdx.x] *= z. The PTX results in {ld, st}.shared .v2 instructions. I therefore have a 2-way bank conflict within one warp, but also a v2 vectorization.
    2. Optimize vectorization: Do something like array[threadIdx.x*2] *= z and array[threadIdx.x*2 +1]. This leads to v4 vectorization but also a 4 way-bank conflict (or does it?).

I am very unsure on how vectorized shared memory instructions interact with memory bank conflicts. I would be greatly thankful for an answer or any hint to material on this topic.

Best,
Tobias

Access to shared memory from the 4 SM Partitions (SMSP), and between different warps in general, is serialized. You will never get bank conflicts (and neither any advantages by coordinated access patterns) from that source.

When doing vectorized accesses (v2, v4), then two-way and four-way bank conflicts are fully acceptable, as the transaction takes the respective amount longer anyway.

For the v4 case, you really have to make sure that the assembler does the v4 optimization by itself by combining the two accesses, otherwise you can define a complex2 type for memory accesses complex c2 = array.ascomplex2[threadIdx.x], and in it do component-wise multiplication c2.first.x *= z. The array could be a union type for different access sizes.

BTW typically you define __shared__ memory as volatile. Synchronization instructions are not enough, are they?

Thank you for your answer! That clears up a lot. I was not quite aware of the concept of transactions. For anyone wondering, I found out more about this at Memory Transactions

I found that the compiler does a really good job finding vectorized accesses without them being explicit in code. Though a more descriptive code might not be a bad thing. I really like the .asType idea, I will definetly adopt that.

Regarding volatile, I never had any problems with shared memory without it.

Thank you for sharing the link to the article about Memory Transactions.

The ‘danger’ of not using volatile is that the compiler can keep accesses cached in local registers or skip them. As long as you are using each single shared memory index only for one dedicated thread (e.g. for short-time manual data caching) and not read or write there with other threads, it is okay.

If you want to share data, the compiler+assembler could decide that accesses are not needed, because it is enough to store data in registers.

__shared__ int sharedmem; // without volatile
sharedmem = 0;
__syncwarp(); // we sync the writes to shared memory
if (threadIdx.x == 0)
    sharedmem = 2;
__syncwarp(); // we sync the writes to shared memory
if (threadIdx.x != 0)
    int read = sharedmem;
    f(read);
else
    int read = sharedmem;
    f(read);

The compiler perhaps knows, that threadIdx.x does not change within a thread.
It can deduce that in the case of threadIdx.x != 0, the memory is always 0. So it decides, not to write the memory for this branch and not to read it back. In the case of threadIdx.x == 0the memory is always 2, so this also does not have to be written or read back.

All accesses to shared memory would be removed.

A programmer would expect all threads to read 2, but that does not happen for threads 1…31, only for thread 0.

The volatile ensures that reads and writes are actually always done and not cached locally. This non-volatile analysis of the compiler does not consider the other threads. It is independent from the synchronization instructions.

So in nearly all cases (except when the memory locations are really separated by thread), you should use volatile for shared memory.

Using volatile on the other hand could worsen the vectorizing.

For absolute full control you can always call inline functions or PTX asm blocks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.