I found that in a piece of code, if a put a particular variable in the first 32 bytes of shared memory, it’s slower than putting it to the higher address. There are other variables in shared memory. Any guess of the reason?
I’m not sure if its related but I have noticed from looking at compiled kernels (with decuda) that sometimes an instruction that uses a variable from a very low shared memory address can be made into a “half” instruction where the same instruction using a variable from a higher shared memory address cannot. I have no idea if these “half” instructions execute any faster though. Even if they do then it should have the opposite effect to what you’re seeing.
It seems more likely that the low shared memory address you are choosing is causing bank conflicts.