question: Thread divergence & kernel performance

Today I was testing the profiler included in release v1.1 of the CUDA SDK on two versions of a simple program, one has thread divergence and bank conflicts and the other is optimized to avoid these problems.

My question is about the number of instructions issued by a G80 and its effect on kernel performance. First, let me post my profiler logs:

program with bank conflicts and thread divergence:

timestamp=[ 6153.044 ] method=[ memcopy ] gputime=[ 3.616 ]

timestamp=[ 6209.049 ] method=[ _Z9reductionPfi ] gputime=[ 25.696 ] cputime=[ 41.550 ] occupancy=[ 0.667 ] instructions=[ 4102 ] cta_launched=[ 1 ]

timestamp=[ 6425.177 ] method=[ memcopy ] gputime=[ 3.136 ]

improved version without bank conflicts and thread divergence:

timestamp=[ 6112.588 ] method=[ memcopy ] gputime=[ 3.552 ]

timestamp=[ 6168.574 ] method=[ _Z9reductionPfi ] gputime=[ 8.800 ] cputime=[ 24.012 ] occupancy=[ 0.667 ] instructions=[ 489 ] cta_launched=[ 1 ]

timestamp=[ 6366.394 ] method=[ memcopy ] gputime=[ 3.200 ]

I’m aware that bank conflicts are serialized and will result in a slower kernel computation time, and that thread divergence leads to more instructions being issued. However, does thread divergence (and as a result, having more instructions) contribute to a slower kernel execution time if the G80 isn’t running at its peak rate. Any insight would be greatly appreciated.

Not sure what you mean by GPU not running at peak rate, but divergence within a warp affects performance. Extent depends on the code. Essentially, instructions from both divergent paths have to be executed by all the threads in the warp.


well if your problem is io bound, the extra instructions might not effect the total time, if the gpu is any ways sitting and waiting for data.

I guess I should have been a little more descriptive about the problem. What I meant by “GPU is not running at its peak rate” is that the problem is io bound (i.e., more arithmetic operations could be done if the device wasn’t waiting for data to be loaded). What would be most helpful is to know if the ‘gputime’ includes the time required for memory operations between the global device memory and the shared device memory, or if those operations are not included in the ‘gputime’ output? I’m having a hard time determining the how much bank conflicts affect the slowdown in the kernel execution time vs. the slowdown associated with thread divergence.

Any help is appreciated.

gputime includes all the time from the beginning of the kernel launch to the time it is finished. So yes, it will include all the time spent by global memory operations between global and shared memory.

I would like to thank everyone for their replies, they have answered most of my questions about the profiler.