What should I optimize first? Divergence? Serialized Warps?

The visual profiling of my kernel gives
branch = 155778364
divergent branch = 501043
instructions = 721023135
warp serialized = 214170934

The visual profile doesn’t let me check if I have uncoalesced loads or stores (for some reason those options are deactivated), but from using the command prompt profiler I don’t have any uncoalesced memory accessing, so I don’t worry for those at the moment.

My first question is what “branch” represents? The explanation says “Number of branch events taken by threads,” and I interpret this as the number of “if,” “else,” “while” statements that the threads of a warp take. Does a bigger or smaller number of “branch” affect performance if there is no divergence? My guess is that it doesn’t.

The percentage of divergent branches is only 0.3%, so I am thinking that shouldn’t be the bottleneck of my kernel right now (and I understand what this number indicates).

I am thinking that the thing I should try to fix first is serialized warps. My understanding is that they are caused by bank conflicts in the shared memory (I don’t use constant, or texture memory), and I believe if I could reduce them I would get a substantial speed up.

However, when I use the cutilBankChecker from the cutil library to check the bank conflicts I see than the maximum I get is 5 way bank conflicts, and I would say, on average, the bank conflicts are 4 way. Should I spend time trying to reduce the warp serialization? ( I have read in a rather old post that the cutilBankChecker is not that reliable). In the Programming guide it says “Shared memory accesses, on the other hand, are usually worth optimizing only in case they have a high degree of bank conflicts.” without giving a clue about how big “high” is.

Wait for CUDA 2.2 (to be released next week, probably), it includes new counters for the memory transactions on G200 chips.

Is your code heavy in global memory reads/writes? How sure are you that what you have is coalesced? These are issues you definitely want to address before optimizing the shared memory bank conflicts. In kernels fully limited by global memory transactions, shared memory bank conflicts really don’t slow performance much.

Warp serialization MIGHT be quite different from “Serialized access to Shared memory”. When it comes to shared memory, it is still only @ the half-warp level.

On the contrary, I think “Warp serializing” refers to “Inter-warp serializing” and “Divergent branches” refer to “intra-warp serializing”.

You have understood Divergent branches correctly.

OTOH,
I THINK" Warp Serializing actually refers to __syncthreads(). Do you have lot of barriers in your code? (especially inside a huge for loop?). If so, multiply that number with the FOR loop iterations and see if it comes close to what the profiler says…

Also,
Note that leaving profiler enabled will cause your GPU to be run @ lesser speeds. It would be a nice idea to disable it - finalllllly - when u r ready with your final code. Juss my 2 cents.

The documentation is always the first place to go…

From the CUDA_Profiler documentation:

warp_serialize

	--------------

	This options records the number of thread warps that serialize on address

	conflicts to either shared or constant memory.

and thus clearly states the conditions that result in a warp_serialize count.

Aah… Right… I was totally wrong in the other post. Thanks for bringing this up. I am fully convinced that I am really a lazy person (just what my wife says :-) ).