question: Thread divergence & kernel performance

jcmadsen · January 18, 2008, 8:35pm

Today I was testing the profiler included in release v1.1 of the CUDA SDK on two versions of a simple program, one has thread divergence and bank conflicts and the other is optimized to avoid these problems.

My question is about the number of instructions issued by a G80 and its effect on kernel performance. First, let me post my profiler logs:

program with bank conflicts and thread divergence:

timestamp=[ 6153.044 ] method=[ memcopy ] gputime=[ 3.616 ]

timestamp=[ 6209.049 ] method=[ _Z9reductionPfi ] gputime=[ 25.696 ] cputime=[ 41.550 ] occupancy=[ 0.667 ] instructions=[ 4102 ] cta_launched=[ 1 ]

timestamp=[ 6425.177 ] method=[ memcopy ] gputime=[ 3.136 ]

improved version without bank conflicts and thread divergence:

timestamp=[ 6112.588 ] method=[ memcopy ] gputime=[ 3.552 ]

timestamp=[ 6168.574 ] method=[ _Z9reductionPfi ] gputime=[ 8.800 ] cputime=[ 24.012 ] occupancy=[ 0.667 ] instructions=[ 489 ] cta_launched=[ 1 ]

timestamp=[ 6366.394 ] method=[ memcopy ] gputime=[ 3.200 ]

I’m aware that bank conflicts are serialized and will result in a slower kernel computation time, and that thread divergence leads to more instructions being issued. However, does thread divergence (and as a result, having more instructions) contribute to a slower kernel execution time if the G80 isn’t running at its peak rate. Any insight would be greatly appreciated.

paulius · January 22, 2008, 2:11am

Not sure what you mean by GPU not running at peak rate, but divergence within a warp affects performance. Extent depends on the code. Essentially, instructions from both divergent paths have to be executed by all the threads in the warp.

Paulius

Eri_Rubin · January 22, 2008, 1:34pm

well if your problem is io bound, the extra instructions might not effect the total time, if the gpu is any ways sitting and waiting for data.

jcmadsen · January 23, 2008, 8:50pm

I guess I should have been a little more descriptive about the problem. What I meant by “GPU is not running at its peak rate” is that the problem is io bound (i.e., more arithmetic operations could be done if the device wasn’t waiting for data to be loaded). What would be most helpful is to know if the ‘gputime’ includes the time required for memory operations between the global device memory and the shared device memory, or if those operations are not included in the ‘gputime’ output? I’m having a hard time determining the how much bank conflicts affect the slowdown in the kernel execution time vs. the slowdown associated with thread divergence.

Any help is appreciated.

MisterAnderson42 · January 23, 2008, 8:54pm

gputime includes all the time from the beginning of the kernel launch to the time it is finished. So yes, it will include all the time spent by global memory operations between global and shared memory.

jcmadsen · January 23, 2008, 9:02pm

I would like to thank everyone for their replies, they have answered most of my questions about the profiler.

Topic		Replies	Views
Confusion about performance guide information CUDA Programming and Performance	7	6731	July 23, 2009
Instruction timings More info than in the guide CUDA Programming and Performance	5	8315	May 21, 2007
Diverge-free doesn't win 32x over Diverge-all warp divergence CUDA Programming and Performance	6	3147	September 14, 2007
What should I optimize first? Divergence? Serialized Warps? CUDA Programming and Performance	4	7254	April 7, 2009
G80 - 14 clocks per Instruction ? CUDA Programming and Performance	4	3240	March 4, 2008
Branching in kernel CUDA Programming and Performance	3	5367	June 5, 2008
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1622	June 15, 2011
Yet another performance question CUDA Programming and Performance	16	4292	February 12, 2009
CUDA, more threads for same work = Longer run time despite better occupancy, Why? CUDA Programming and Performance	9	6014	March 25, 2010
divergent codepath CUDA Programming and Performance	2	4292	February 8, 2010

question: Thread divergence & kernel performance

Related topics