I am trying to work out where in my kernel what I believe to be excessive warp serialisation is coming from and in particular the number that appears in the profiling output. According to the profile documentation:
warp_serialize : Number of thread warps that serialize on address conflicts
to either shared or constant memory
I have removed use of shared variables from my kernel and am using global for the arguements for the kernel. My ptxinfo output is:
ptxas info : Compiling entry function ‘_Z10saw_trimerv’
ptxas info : Used 30 registers, 320+0 bytes lmem, 28 bytes cmem, 4 bytes cmem
No shared memory and a small amount of constant memory (28 bytes into constant program ‘variables’ and 4 bytes into compiler generated constants),
Profiling output is:
method=[ _Z10saw_trimerv ] gputime=[ 43286936.000 ] cputime=[ 43286976.000 ] occupancy=[ 0.500 ] cta_launched=[ 768 ] warp_serialize=[ 762985779 ] divergent_branch=[ 226520297 ] branch=[ 2164832414 ]
So where is the warp_serialize be coming from? The above number is not much different when I revert to using shared memory for some of my data.
I have checked the PTX file and see no reference to shared or constant memory access but I could have missed something. Attached is the PTX file.
sawtrimer.txt (28.4 KB)