Shared Memory Bank Conflicts Performance effects of bank conflicts


I am currently testing some applications producing heavy bank conflicts in the shared memory (intenionally).
I have profiled these applications with the “Compute Visual Profiler” and it produced the following metrics concerning the bank conflicts:
- l1 shared bank conflict: 992992
-Shared memory bank conflict per shared memory instruction (%): 3100
-Shared bank conflict replay (%): 59,0139

I have also compared this application (the conflict one) with another one without any conflicts:
- l1 shared bank conflict: 0
-Shared memory bank conflict per shared memory instruction (%): 0
-Shared bank conflict replay (%): 0,0004

The applications have the same number of instructions executed but the number of instructions issued differ (as shared memory accessses are getting serialized):

  • 1682641 instructions issued (conflicts)
  • 689649 instructions issued (no conflicts)

I have checked that the two applications only differ in the access pattern to the shared memory, as I want to measure the extent of the bank conflicts to the runtime.
So here are my questions:
-I assume that if I have a 2-way bank conflict and the shared memory access gets serialized, the access-runtime would be two times as long as the one without a conflict.
For my applications it would mean that the runtime should be about ~30 times bigger than the one without conflicts (as i produced maximum number of conflicts).
But my tests showed that the application producing conflicts is only 4 times slower (i.e. there is ~ a factor 8 which i cannot explain). To what extent are the shared memory conflicts really influencing the runtime of an application?

-The parameter “shared bank conflict replay (~60%)” points out the percentage of replays due to shared bank conflicts (the overall replays are also ~60%). The parameter is calculated as:
100 * (l1 shared conflicts) / instructions issued -> this produces the ~60%
But the meaning of this 60% is not clear to me. How do the replays effect the runtime of my application?

I would be pleasured if anyone has some answers to my questions.
Regards, Georg.

-CUDA 4.0
-Compute Visual Profiler 4.0.17