I observe strange “Warp Serialize” numbers in the CUDA profiler if bank conflicts in “upper” halfwarps occur (CUDA 2.3 on Linux with GTX295 and CUDA 2.3a on MacOS with 8600M and 8800GT).
I wrote a small kernel to create different scenarios of shared memory bank conflicts, and measured a single block with a single warp. (Of course, I had to execute it multiple times until it was scheduled on multiprocessor 0, because Warp Serialization is an event measured only there.)
If I provoke bank conflicts only in the first halfwarp, I exactly get the counter value expected. So if the highest bank conflict is n-way, the counter is incremented by (n-1) – (n-1) additional shared memory accesses are requried.
However, if there is/are the same bank conflict(s) in the second halfwarp instead , the counter shows a number 7 times as high. Moreover, counter values for the first and second halfwarp are added up. So if shared memory accesses of a warp have an n1-way bank conflict in the first and an n2-way in the second halfwarp, the counter is increased by (n1-1)+7(n2-1).
Looking at the performance, it behaves different again and corresponds to what the manuals say: Bank conflicts lead to a serialization of the whole warp. So shared memory accesses with n1-way conflicts in the first and n2-way in the second halfwarp will take max(n1,n2) times longer than if there is no conflict at all.
I consider the factor of 7 for conflicts in the second halfwarp as a serious bug that makes the counter only usable if you can rely on enough randomness of the conflicts.
Additionally, this counter does not really relate to serialization, but to the conflicts in each halfwarp. In my opinion, this fact should be reflected in the documentation.