Interpreting profiler output Where to go from here?


my kernel does some very complex computation and I’m asking myself where to start looking for possible optimizations. Letting it pass through the Visual Profiler, I get the following result:

That is, GPU time is 99.99%, 23.040/15.360 uncoalesced stores/loads, 98292 branches and 224 million instructions. All of that for basically a single result vector.

From my unterstanding, the uncoalesced memory access shouldn’t affect performance at all (they are caused by copying input vectors from global to shared memory). Given the number of instructions, the kernel is cpu- and not memory-bound so dealing with better memory access should not be the first step. Am I right? How to put these numbers into perspective with each other?

Hmm, even if a kernel isn’t memory bound, uncoalesced global memory access is typically a huge performance hit.

Every uncoalesced global memory access takes 300 clock cycles. This latency is hidden to an extent if you have a lot of threads in flight, but it still hurts.

Are the branches you mention divergent branches? That can also impede performance. To avoid divergent branches (which end up being serialized), its important that any if conditions are checked in terms of the warp granularity (i.e: if you have an if statement that depends on the threadID, the condition should direct different threads to different execution paths in multiples of the warp size). Sorry if I’m not explaining this in the best way, there is more in the performance section of the programming guide.

This condition is actually pretty hard to enforce in practice, for most real algorithms, some divergent branches are unavoidable.

Also, what is the occupancy for your kernel? Check the cubin file to see how many registers you’re using.