Hi everyone,
I am struggling with the following problem for quite a while now. Basically, I have a struct of two floats (complex numbers) and I am trying to figure out the best memory access pattern when loading those numbers from shared memory.
struct __align__(2*sizeof(float)) complex {
float real, imag;
};
The following questions emerged, when I thought about it:
- Memory bank conflicts happen within one warp, but do they also happen to threads within one block but not within one warp? E.g. would
array[threadIdx.x]++
lead to a 4-way conflict if the block had 128 threads and array was of type__shared__ float array[]
, since the scheduler could in principal run all 128 threads in parallel? - If I now have a shared array of the aforementioned complex numbers and I want to perform some operations on them, e.g., multiplying with some complex constant z, I could go two ways:
1. Optimize memory bank conflicts: Just go array[threadIdx.x] *= z. The PTX results in {ld, st}.shared .v2 instructions. I therefore have a 2-way bank conflict within one warp, but also a v2 vectorization.
2. Optimize vectorization: Do something like array[threadIdx.x*2] *= z and array[threadIdx.x*2 +1]. This leads to v4 vectorization but also a 4 way-bank conflict (or does it?).
I am very unsure on how vectorized shared memory instructions interact with memory bank conflicts. I would be greatly thankful for an answer or any hint to material on this topic.
Best,
Tobias