Does shared memory bank conflict really affect the performance? Shared memory bank conflict, perform

Hello,

I looked at the matrix transpose kernel in SDK.

I read the comment about how +1 in shared array size can remove shared memory bank conflict.

So, I tested the runtime with or without it.
In other words, I ran the program with changing shared memory array size as follows:

  1. shared float block[BLOCK_DIM][BLOCK_DIM+1];

  2. shared float block[BLOCK_DIM][BLOCK_DIM];

Obviously, 2) has 16-way bank conflict, and I could see that using GPGPU-sim.

However, the problem is the runtime does not show much difference.

I increased the input size to very large, so the runtime is about 300 ms. The runtime difference between two is only 0.x ms or 0.0x ms… very negligible. I even changed the kernel so that the latency of reading shared memory could more dominate the total runtime, but I could barely see the difference.

My machine is TESLA C1060…

Does shared memory bank conflict really make the runtime different much?

Matrix transposition is going to be limited by the global memory bandwidth. I am not surprised that shared memory conflicts are a negligible effect.

The key to applying any of these performance heuristics is to recognizing when they are the bottleneck in your code. (This requires a mixture of benchmarking code variations and some experience to do well.)

Read the paper until the end. The moment where you switch from Bank Conflict to No bank conflict, you’re limited by a “Partition camping” effect. If you re do the same test with the final code (the one without partion camping), you will see the difference beetween bank conflict or no bank conflict.