Accounting for warp serialisation


I am trying to work out where in my kernel what I believe to be excessive warp serialisation is coming from and in particular the number that appears in the profiling output. According to the profile documentation:

warp_serialize : Number of thread warps that serialize on address conflicts
to either shared or constant memory

I have removed use of shared variables from my kernel and am using global for the arguements for the kernel. My ptxinfo output is:

ptxas info : Compiling entry function ‘_Z10saw_trimerv’
ptxas info : Used 30 registers, 320+0 bytes lmem, 28 bytes cmem[1], 4 bytes cmem[14]

No shared memory and a small amount of constant memory (28 bytes into constant program ‘variables’ and 4 bytes into compiler generated constants),

Profiling output is:

method=[ _Z10saw_trimerv ] gputime=[ 43286936.000 ] cputime=[ 43286976.000 ] occupancy=[ 0.500 ] cta_launched=[ 768 ] warp_serialize=[ 762985779 ] divergent_branch=[ 226520297 ] branch=[ 2164832414 ]

So where is the warp_serialize be coming from? The above number is not much different when I revert to using shared memory for some of my data.

I have checked the PTX file and see no reference to shared or constant memory access but I could have missed something. Attached is the PTX file.


sawtrimer.txt (28.4 KB)

Unless all threads in warp read the same words of “constant” memory,
you get warp_serialise and your kernel slows down.
Perhaps this is waht you are seeing?



Do you use atomic operations? IIRC they also get serialized, at least on conflicts.

Thanks guys for the responses. Yes, I am using an atomic operation. I will take a look at it.

Do you know how I can find out from the PTX where my constant memory accesses are? I have none explicitly in my code so I assume the compiler has put them in. I think the accesses should be from the same memory location.


Search for “.const”.

I’d also assume that constant memory references put in by the compiler always access the same location per warp.

I can’t find any in the PTX file.

I have removed my use of atomic operations and cleaned up my code. I still have ‘12 bytes cmem[1]’.

In “nvcc.doc V2.0” page 28, it says

"Used constant memory is partitioned in constant program ‘variables’ (bank 1), plus

compiler generated constants (bank 14)."

What are ‘constant program variables’?