does a switch statement by thread id cause divergence

LHickey · January 6, 2011, 8:40pm

if a different formula is needed by threadid, will this cause divergence?
switch(threadIdx.x)
{
case(0): formula0();break;
case(1): formula1();break;
case(2): formula2();break;
}

i.e. will thread 0 and thread 1 and thread 2 execute concurrently ?
Does the answer differ if the switch is a function of only the thread id? that is switch( F(threadIdx.x): … }?

I understand that if the problem was formed like this would cause divergence, and the warp would stall and not do the formulas concurrently.
if ( threadIdx.x == 0) formula0();
else if ( threadIdx.x == 1) formula1()
else if ( threadIdx.x == 2) formula2();

Remember what kermit said- it ain’t easy being green. So have mercy on the cuda green/newbie.

Suppose the answer I get is that within a warp, that the switch is no better than if else if … and I have to deal with warp divergence.
I understand that divergence is what occurs within a warp, What if, using mod 32 on the threadid, can I get threads in different warps to do different formulas and will those different warps execute their different formulas concurrently.

SPWorley · January 6, 2011, 11:32pm

Yes, a switch statement will diverge threads exactly like the equivalent if() statement would.

You’re right that there’s no divergence if all the threads in a warp follow the same branch, though. This is very common if you’re switching on thread ID.
Switching on mod32 of thread ID is exactly the wrong choice, though… that’s worst case. You’ll get no divergence if you can keep bundles of 32 threads together, so you’d switch on threadIdx.x/32. I often name this “wid” in my own code, short for warp ID, similar to using “tid” for thread ID.

LHickey · January 7, 2011, 4:21pm

Thanks very much. Does the c trigraph get a dispensation from causing divergence. i.e. (a<b)?a:b within a warp? I need to implement min and max etc and if this idiom is trouble, is there an alternative, or do I just need to endure the divergence at this statement?

SPWorley · January 7, 2011, 5:07pm

That’s not a trigraph, BTW. It’s the conditional operator. But yes, that’s a conditional statement and can cause divergence.

But a few points:

For very small divergences (which are common in code) PTX uses predication, which is still a divergence but it’s very lightweight.
For the specific example of min and max, even if you manually write the conditional above, the compiler can identify it and replace it with the native PTX min and max one-clock opcodes.
You’re worrying too much about removing all divergence. You want to remove cases where threads diverge and stay diverged for thousands of instructions. Don’t sweat a conditional where some threads are masked off for four or five ops.

avidday · January 7, 2011, 5:52pm

there are min and max operators in the CUDA math library which compile to hardware instructions and allow you to bypass this unnecessary had wringing about divergence altogether…

LHickey · January 7, 2011, 10:45pm

This is very useful. I have a C2050 fermi card, and suppose I want to create a cuda kernel with 10 warps.
Kernel <<<dim3 dG, dim3 dB >>>( … ) dB.x = 32 and dB.y is 10 will this do it, so that when Kernel wakes up. it will have threads with blockIdx.x from 0-31 and blockIdx.y from 0 to 9, and
then if blockIdx.y = 0, the threads with blockIdx.x=0-31 are in the same warp.
and if blockIdx.y=1, the threds with blockIdx.x=0 -31 are in the second warp …
so I can factor the work so that the warps have minimal divergence within. The threads with different stuff to do will be in different warps. then.
Have I got the idea right?
Can I use up to 32, (instead of 10) so I can have 32 warps in each thread block?.
I am very grateful for help.

in other words- with dim3 dimBlock(16,32) then for any fixed blockId.x between 0 and 15, do all threads with blockId.y between 0 and 31 belong to the same warp?

1)In C P[y], y varies fastest when viewed at memory layout so I assumed that memory layout is a key to the grouping of the 16*32 threads into warps.

2)Fortran rules on the other hand would, suggest that P[x,y], x varies fastest so maybe if you want 16 different warps, that dimBlock(32,16) is right, and that for any fixed blockId.y between 0 and 15, that all threads with blockId.x (between 0 and 31) belong to the same warp.

So fortran, or C rules then?

Topic		Replies	Views
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21406	March 12, 2007
Thread Divergence CUDA Programming and Performance	5	2840	June 1, 2010
There is any solution to avoid the warp divergence by switch case ? CUDA Programming and Performance	7	1794	October 10, 2014
Avoiding thread divergence CUDA Programming and Performance	5	4091	December 12, 2014
Avoid branching ... CUDA Programming and Performance	3	3704	May 19, 2010
Shift direction and divergence CUDA Programming and Performance	7	498	November 13, 2020
threads diverging in a loop when does a loop cause divergance CUDA Programming and Performance	13	21111	May 12, 2007
Switch statements and performance without thread divergence CUDA Programming and Performance	0	817	June 3, 2020
Is there warp divergence in reduce0 kernel which is implemented in the CUDA sample Reduction? CUDA Programming and Performance	4	941	January 8, 2020
Thread divergence due to IF CUDA Programming and Performance	3	6914	September 13, 2007

does a switch statement by thread id cause divergence

Related topics