Warp Serialize

I have this algorithm in which i compute a more or less complicated sum on a large dataset. I want to fine-grain my parallelism so that each thread computes on term of this sum and then add all this terms. Since I have to execute slightly different code on each term and each thread (some matrix-vector multiplication, sometimes the inverse matrix) I need flow control.

So I thought, I just split my sum on each warp, having each warp calculate the relevant terms on 32 data items. Therefore, I should not have any divergent branches, as each warp executes the same instructions.

But… I can not have enough threads to hide register read-after-write hazards and need to sync after each warp. So, each warp gets executed basically serial, which is reflected in the number ‘warp serialize’ in the profiler, if I am correct.

My question, if someone understands my proceeding, what is the impact of this warp serialization? I made some small experiments, putting together two or more terms into a single warp, the execution time gets slightly larger, 5% or so. But I am unsure regarding the big picture. Also clarification about this ‘warp serialize’-token in the profiler would be appretiated.

From the CUDA profiler documentation:

warp_serialize

	--------------

	This options records the number of thread warps that serialize on address

	conflicts to either shared or constant memory.

Your warp serialize issues are coming from somewhere else.

As for the penalty of your sync operations, one simple way to test the overhead would be to just remove the syncs :) You’ll get incorrect results of course, but assuming that the length of loops don’t depend on values read after a sync all the threads will have done the same amount of work without the overhead so you can benchmark it.

Many applications in CUDA are bound by memory bandwidth, not computation, so it is likely that you will not notice a significant difference.