Warp serialization

Hello,

I coded a kernel and, after running it through CUDA Profiler, noticed that the warp serialize field was very high — indeed, higher than the branch count.

Since this kernel has very little branches (1 precisely, and is not supposed to be different for any thread), is there any other PTX instruction that cause serialization? This kernel does have quite a few selection instruction (i.e. selp and the like), that could be the culprit.

Thoughts?

warp_serialize has nothing to do with branches. If you look in doc/CUDA_Profiler_2.1.txt it tells you that warp_serialize is on a serialized access to constant memory (different threads in a warp access different elements in the array) or a shared memory bank conflict.