There is any solution to avoid the warp divergence by switch case ?

cheer37 · October 9, 2014, 3:02pm

My one device function which is called many times, has switch case clause with 18 cases.
So, i have a big latency due to warp divergence by that switch.
There is any solution to avoid the warp divergence by switch case?
Thanks in advance.

cbuchner1 · October 9, 2014, 3:15pm

Dump your thread states to an array, sort the threads by switch condition, recover thread states in sorted order, continue execution. ;)

cheer37 · October 9, 2014, 3:38pm

How can i implement it?
Would you provide small sample code?
Thanks in advance.

MutantJohn · October 9, 2014, 5:21pm

Because CUDA follows the Same Instruction Multiple Data paradigm, what cbuchner1 means is, sort your threads by the switch condition and then you’ll have more full warps.

So instead of having your threads with switch conditions looking like this : 0 1 3 2 1 0 1 1 2 3 3 1 0

You should sort it and then have 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3.

This way you’ll get greater warp efficiency. But keep in mind, this also means you have the overhead of sorting so it’s a trade-off.

cheer37 · October 10, 2014, 4:43am

Yes, I got it theoretically already, but to do it,
first, i have to synchronize all threads in grid not block effectively, how can i do it? and this might consume the performance.
second, i have to change warp configuration according to the switch case order. how can i change warp configuration?
I don’t know how to implement these two problems in code. so i need to small sample code to implement it. I never seen it before.
Thanks.

MutantJohn · October 10, 2014, 4:49am

I think for a global sync, you need to use separate kernels in the same stream.

What’s warp configuration? Are warps not always 32 threads?

cheer37 · October 10, 2014, 7:52am

I meant thread arrangement in warp.
To avoid thread divergence, as you referred, I should arrange threads in warp, in order 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3.

cheer37 · October 10, 2014, 8:09am

oh, i got it, both of you meant i need to separate the kernel two parts and do arrange only status array(case status).
Thanks very much, cbuchner1 and mutantJohn:)

Topic		Replies	Views
Avoid branching ... CUDA Programming and Performance	3	3609	May 19, 2010
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	14021	June 14, 2009
Shift direction and divergence CUDA Programming and Performance	7	383	November 13, 2020
To avoid warp divergence CUDA Programming and Performance	7	33	February 22, 2025
Sort 32 element in a warp CUDA Programming and Performance	5	40	May 6, 2025
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21238	March 12, 2007
does a switch statement by thread id cause divergence CUDA Programming and Performance	5	3196	January 7, 2011
Branching? CUDA Programming and Performance	7	3158	March 16, 2012
branch diveragence with if/while same as if one of the threads in a warp returning CUDA Programming and Performance	18	2747	December 13, 2011
reduction optimization #1 Not agree with performances explanation CUDA Programming and Performance	8	6666	August 1, 2008

There is any solution to avoid the warp divergence by switch case ?

Related topics