I looked at the matrix transpose kernel in SDK.
I read the comment about how +1 in shared array size can remove shared memory bank conflict.
So, I tested the runtime with or without it.
In other words, I ran the program with changing shared memory array size as follows:
shared float block[BLOCK_DIM][BLOCK_DIM+1];
shared float block[BLOCK_DIM][BLOCK_DIM];
Obviously, 2) has 16-way bank conflict, and I could see that using GPGPU-sim.
However, the problem is the runtime does not show much difference.
I increased the input size to very large, so the runtime is about 300 ms. The runtime difference between two is only 0.x ms or 0.0x ms… very negligible. I even changed the kernel so that the latency of reading shared memory could more dominate the total runtime, but I could barely see the difference.
My machine is TESLA C1060…
Does shared memory bank conflict really make the runtime different much?