Warp Serialize

bwalk · November 3, 2008, 8:49pm

I have this algorithm in which i compute a more or less complicated sum on a large dataset. I want to fine-grain my parallelism so that each thread computes on term of this sum and then add all this terms. Since I have to execute slightly different code on each term and each thread (some matrix-vector multiplication, sometimes the inverse matrix) I need flow control.

So I thought, I just split my sum on each warp, having each warp calculate the relevant terms on 32 data items. Therefore, I should not have any divergent branches, as each warp executes the same instructions.

But… I can not have enough threads to hide register read-after-write hazards and need to sync after each warp. So, each warp gets executed basically serial, which is reflected in the number ‘warp serialize’ in the profiler, if I am correct.

My question, if someone understands my proceeding, what is the impact of this warp serialization? I made some small experiments, putting together two or more terms into a single warp, the execution time gets slightly larger, 5% or so. But I am unsure regarding the big picture. Also clarification about this ‘warp serialize’-token in the profiler would be appretiated.

MisterAnderson42 · November 4, 2008, 1:16pm

From the CUDA profiler documentation:

warp_serialize

	--------------

	This options records the number of thread warps that serialize on address

	conflicts to either shared or constant memory.

Your warp serialize issues are coming from somewhere else.

As for the penalty of your sync operations, one simple way to test the overhead would be to just remove the syncs :) You’ll get incorrect results of course, but assuming that the length of loops don’t depend on values read after a sync all the threads will have done the same amount of work without the overhead so you can benchmark it.

Many applications in CUDA are bound by memory bandwidth, not computation, so it is likely that you will not notice a significant difference.

Topic		Replies	Views
Loops in kernels CUDA Programming and Performance	2	1324	September 3, 2009
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
Cost of serialization. The cost of wrap execution serialization CUDA Programming and Performance	5	7112	July 9, 2008
cuda profiler reports high warp serialize CUDA Programming and Performance	5	2057	May 14, 2010
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2945	December 28, 2008
Bank Conflicts and Serialized Warps CUDA Programming and Performance	6	7802	December 4, 2009
Execute different instruction for each warp and synchronize CUDA Programming and Performance	6	1389	November 22, 2011
Single thread benchmarking with clock64() CUDA Programming and Performance	5	985	February 9, 2017
Warp serialization CUDA Programming and Performance	1	8871	January 30, 2009
branch diveragence with if/while same as if one of the threads in a warp returning CUDA Programming and Performance	18	2738	December 13, 2011

Warp Serialize

Related topics