does a switch statement by thread id cause divergence

if a different formula is needed by threadid, will this cause divergence?
case(0): formula0();break;
case(1): formula1();break;
case(2): formula2();break;

i.e. will thread 0 and thread 1 and thread 2 execute concurrently ?
Does the answer differ if the switch is a function of only the thread id? that is switch( F(threadIdx.x): … }?

I understand that if the problem was formed like this would cause divergence, and the warp would stall and not do the formulas concurrently.
if ( threadIdx.x == 0) formula0();
else if ( threadIdx.x == 1) formula1()
else if ( threadIdx.x == 2) formula2();

Remember what kermit said- it ain’t easy being green. So have mercy on the cuda green/newbie.

Suppose the answer I get is that within a warp, that the switch is no better than if else if … and I have to deal with warp divergence.
I understand that divergence is what occurs within a warp, What if, using mod 32 on the threadid, can I get threads in different warps to do different formulas and will those different warps execute their different formulas concurrently.

Yes, a switch statement will diverge threads exactly like the equivalent if() statement would.

You’re right that there’s no divergence if all the threads in a warp follow the same branch, though. This is very common if you’re switching on thread ID.
Switching on mod32 of thread ID is exactly the wrong choice, though… that’s worst case. You’ll get no divergence if you can keep bundles of 32 threads together, so you’d switch on threadIdx.x/32. I often name this “wid” in my own code, short for warp ID, similar to using “tid” for thread ID.

Thanks very much. Does the c trigraph get a dispensation from causing divergence. i.e. (a<b)?a:b within a warp? I need to implement min and max etc and if this idiom is trouble, is there an alternative, or do I just need to endure the divergence at this statement?

That’s not a trigraph, BTW. It’s the conditional operator. But yes, that’s a conditional statement and can cause divergence.

But a few points:

  1. For very small divergences (which are common in code) PTX uses predication, which is still a divergence but it’s very lightweight.

  2. For the specific example of min and max, even if you manually write the conditional above, the compiler can identify it and replace it with the native PTX min and max one-clock opcodes.

  3. You’re worrying too much about removing all divergence. You want to remove cases where threads diverge and stay diverged for thousands of instructions. Don’t sweat a conditional where some threads are masked off for four or five ops.

there are min and max operators in the CUDA math library which compile to hardware instructions and allow you to bypass this unnecessary had wringing about divergence altogether…

This is very useful. I have a C2050 fermi card, and suppose I want to create a cuda kernel with 10 warps.
Kernel <<<dim3 dG, dim3 dB >>>( … ) dB.x = 32 and dB.y is 10 will this do it, so that when Kernel wakes up. it will have threads with blockIdx.x from 0-31 and blockIdx.y from 0 to 9, and
then if blockIdx.y = 0, the threads with blockIdx.x=0-31 are in the same warp.
and if blockIdx.y=1, the threds with blockIdx.x=0 -31 are in the second warp …
so I can factor the work so that the warps have minimal divergence within. The threads with different stuff to do will be in different warps. then.
Have I got the idea right?
Can I use up to 32, (instead of 10) so I can have 32 warps in each thread block?.
I am very grateful for help.

in other words- with dim3 dimBlock(16,32) then for any fixed blockId.x between 0 and 15, do all threads with blockId.y between 0 and 31 belong to the same warp?

1)In C P[y], y varies fastest when viewed at memory layout so I assumed that memory layout is a key to the grouping of the 16*32 threads into warps.

2)Fortran rules on the other hand would, suggest that P[x,y], x varies fastest so maybe if you want 16 different warps, that dimBlock(32,16) is right, and that for any fixed blockId.y between 0 and 15, that all threads with blockId.x (between 0 and 31) belong to the same warp.

So fortran, or C rules then?