cudaprof "Warp Serialize" counter Strange factor of seven and imprecise documentation?

philipjfry · January 21, 2010, 1:00pm

I observe strange “Warp Serialize” numbers in the CUDA profiler if bank conflicts in “upper” halfwarps occur (CUDA 2.3 on Linux with GTX295 and CUDA 2.3a on MacOS with 8600M and 8800GT).

I wrote a small kernel to create different scenarios of shared memory bank conflicts, and measured a single block with a single warp. (Of course, I had to execute it multiple times until it was scheduled on multiprocessor 0, because Warp Serialization is an event measured only there.)

If I provoke bank conflicts only in the first halfwarp, I exactly get the counter value expected. So if the highest bank conflict is n-way, the counter is incremented by (n-1) – (n-1) additional shared memory accesses are requried.

However, if there is/are the same bank conflict(s) in the second halfwarp instead , the counter shows a number 7 times as high. Moreover, counter values for the first and second halfwarp are added up. So if shared memory accesses of a warp have an n1-way bank conflict in the first and an n2-way in the second halfwarp, the counter is increased by (n1-1)+7(n2-1).

Looking at the performance, it behaves different again and corresponds to what the manuals say: Bank conflicts lead to a serialization of the whole warp. So shared memory accesses with n1-way conflicts in the first and n2-way in the second halfwarp will take max(n1,n2) times longer than if there is no conflict at all.

I consider the factor of 7 for conflicts in the second halfwarp as a serious bug that makes the counter only usable if you can rely on enough randomness of the conflicts.

Additionally, this counter does not really relate to serialization, but to the conflicts in each halfwarp. In my opinion, this fact should be reflected in the documentation.

Comments welcome!

cudaprof.html states:

Topic		Replies	Views
Counting the number of bank conflicts. CUDA Programming and Performance	1	1302	July 19, 2010
Bank Conflicts and Serialized Warps CUDA Programming and Performance	6	7877	December 4, 2009
cuda profiler reports high warp serialize CUDA Programming and Performance	5	2108	May 14, 2010
warp serialize CUDA Programming and Performance	1	6163	November 16, 2010
the problem of cudaProf counters I can't get correct value of "warp serialize" CUDA Programming and Performance	0	1254	April 26, 2010
Sources of "warp serialize" events in profiler output? CUDA Programming and Performance	0	1201	December 16, 2009
How warp serialization works on shared memory How to run a "data[n] += something" efficientl CUDA Programming and Performance	26	3427	May 26, 2010
warp serialize problem CUDA Programming and Performance	2	2535	December 27, 2009
about warp serial I got a strange with warp serial CUDA Programming and Performance	3	1428	June 25, 2009
Having problems with warp divergence/serialization profiler: high warp serialize rate although diver CUDA Programming and Performance	4	1702	October 27, 2009

cudaprof "Warp Serialize" counter Strange factor of seven and imprecise documentation?

Related topics