In the kernel you give, global memory is going to be the bottleneck so a few piddly bank conflicts will not slow your performance.
Why do you even need shared memory in this kernel? You aren’t sharing values between threads in a block, so just dump in[index] into a local float2 variable.
I see. I ask the question about this kernel because i have this kind of problems with another one, a float2 array transpose.
I wanted to use the SDK example and modify it to transpose a float2 array. But i get a lot of bank conflicts. I’m wondering about how i could avoid them.
That’s what i thought. But when i just use a variable, like this :
float2 a;
a = in[index];
out[index] = a.x;
I get uncoalesced loads and bad performance. Does nvcc “optimizes” this by removing the “useless” variable ?