There is any solution to avoid the warp divergence by switch case ?

My one device function which is called many times, has switch case clause with 18 cases.
So, i have a big latency due to warp divergence by that switch.
There is any solution to avoid the warp divergence by switch case?
Thanks in advance.

Dump your thread states to an array, sort the threads by switch condition, recover thread states in sorted order, continue execution. ;)

How can i implement it?
Would you provide small sample code?
Thanks in advance.

Because CUDA follows the Same Instruction Multiple Data paradigm, what cbuchner1 means is, sort your threads by the switch condition and then you’ll have more full warps.

So instead of having your threads with switch conditions looking like this : 0 1 3 2 1 0 1 1 2 3 3 1 0

You should sort it and then have 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3.

This way you’ll get greater warp efficiency. But keep in mind, this also means you have the overhead of sorting so it’s a trade-off.

Yes, I got it theoretically already, but to do it,
first, i have to synchronize all threads in grid not block effectively, how can i do it? and this might consume the performance.
second, i have to change warp configuration according to the switch case order. how can i change warp configuration?
I don’t know how to implement these two problems in code. so i need to small sample code to implement it. I never seen it before.
Thanks.

I think for a global sync, you need to use separate kernels in the same stream.

What’s warp configuration? Are warps not always 32 threads?

I meant thread arrangement in warp.
To avoid thread divergence, as you referred, I should arrange threads in warp, in order 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3.

oh, i got it, both of you meant i need to separate the kernel two parts and do arrange only status array(case status).
Thanks very much, cbuchner1 and mutantJohn:)