How efficient is my kernel?

How can I analyse the profiler output properly? I have a value of 1.9m warps serialized and 19m divergent branches, but how can I tell how good or bad this is? Can I compare it with another value?

For example, I have 1.5b instructions with 252m branches, so in comparison 19m divergent ones aren’t that bad right?

I generally use the profiler on different revisions of my code. I run it initially to see what I need to fix (uncoalesced reads/writes, warp serializes, divergent branches) and run it again after modifying my code to see what progress I have made.

Each algorithm is different, it’s hard to benchmark your results against something different. Ideally, you would have 0 warps serialized (No shared memory bank conflicts) and 0 divergent branches but that’s not always possible. Use the profiler to find out what code is causing these problems and optimize that…

Thanks. How would I analyse which code is causing the problems? The profiler only tells me total statistics for the entire kernel.


Well, you can start by looking through the code for known issues such as bank conflicts or non-sequential addressing to global memory.

If it comes down to it, comment out some code (global/shared reads or writes) and see what effect it has in the profiler. If your counts drop, you know where your problem is. It’s not really efficient but it’ll work until you get to the point where you can spot the issues by eye.

Just make sure that your kernel still does something or else it may get compiled out, leaving just an empty kernel.