trading memory to negate bank conflicts


matrix transposes are one example where memory footprint is traded for negating bank conflicts - a greater array size is used to prevent bank conflicts

if such a trade-off is indeed possible, would i be able to trade memory footprint to prevent bank conflicts in the following case, and what is the required memory increment?

a kernel calculates type double data for a subsequent kernel to use
4 doubles are calculated over the course of the kernel, and must likewise be packaged into groups - packets - of 4 doubles for the next kernel; hence, the kernel must write doubles from local memory to shared memory (to later write to global memory) with a stride of 4

according to my calculations, writing single doubles with a stride of 4, every 4 threads within a half warp would experience a bank conflict, and incrementing the address by 1 (double) every 4 threads would remove the bank conflict

x = (4 * threadIdx.x) + (threadIdx.x / 4) + y; y = [0, …, 3]
shared = local_double;

not so?

assume 32-bit mode, not 64-bit mode