Hello,
I coded a kernel and, after running it through CUDA Profiler, noticed that the warp serialize field was very high — indeed, higher than the branch count.
Since this kernel has very little branches (1 precisely, and is not supposed to be different for any thread), is there any other PTX instruction that cause serialization? This kernel does have quite a few selection instruction (i.e. selp and the like), that could be the culprit.
Thoughts?