switch construction from C compiled to sequential if elseif elseif ...

How to force nvcc to compile switch structure using assembler branch instruction with precalculated jump address in register?
Followed code construction:

switch(variable) {
case 0: casecode0; break;
case 1: casecode1; break;
case 2: casecode2; break;
case 3: casecode3; break;
}

Supposing size of casecode0 = size of casecode1 = size of casecode2 = size of casecode3
(even if sizes are not equal, compiler can align each particular casecode to value of largest casecode or better more portable technique but slightly slower is to replace every casecode with single unconditional branch instruction which will jump to address of it’s code. On that way case structure is a list of branch instructions with fixed known size and jump to each address could be easy calculated.)
Previous code is expected to be compiled to something like:

mov r1,Case0Address
mov r2,codesize
mul r2,variable
add r1,r2
bra r1

@Case0Address:
code0
bra ExitSwitch
@Case1Address:
code1
bra ExitSwitch
@Case2Address:
code2
bra ExitSwitch
@Case3Address:
code3
bra ExitSwitch

@ExitSwitch


or more portable way (slightly slower but doesn’t require all case codes are aligned to size of largest, meaning memory is saved)

mov r1,CaseList
mov r2,variable
mul r2,SizeOfBranchInstruction /* determined by hardware architecture */
add r1,r2
bra r1

@CaseList:
bra Case0Address
bra Case1Address
bra Case2Address
bra Case3Address

@Case0Address:
code0
bra ExitSwitch
@Case1Address:
code1
bra ExitSwitch
@Case2Address:
code2
bra ExitSwitch
@Case3Address:
code3
bra ExitSwitch

@ExitSwitch

That is expected (or similar like that) where each case require same time of execution branch (if codes are equal in length) and all threads executing the code will reach ExitSwitch after equal number of executed instructions.
Unfortunately nvcc compiles switch construction like sequence of if elseif elseif elseif statements meaning previous code will be compiled like you did write

if(variable==0) code0
elseif(variable==1) code1
elseif(variable==2) code2
elseif(variable==3) code3

I was disappointed after looking in ptx file.
This way is much much slower especially if construction has large number of cases. For instant example suppose each thread has it’s own qualifier (variable) on which depends which code of let say 16 codes should be executed. With if…elseif… structure, all threads with qualifier==16 must do 15 wasted comparisons. Only threads with qualifier==1 will execute their code after only one comparison.

The main reason why switch construction is implemented in standard C language is obvious on previous example. It doesn’t suffer for such problem. It can use several approach to optimize and calculate branch addresses without sequential comparisons (in some specific cases it uses shifting and masking instructions to calculate address of jump but more often it doesn’t). The only hardware requirement is assembler jump (or call) instruction with register as argument and it exists on nvidia’s GPUs.

So my question is why nvcc doesn’t use it when it exists? Second question, did anyone write any C code which compiled by nvcc produce

bra r1

instruction in ptx fajl?

There is no way to enforce such behaviour. And the reasion is really simple:

It is SIMD.

You cant do a calculated branch in this case, because then all cpus would branch to the same (which is not what you want).

I’d agree with you that an efficient compilation of a switch statement would be useful. However, in the PTX Instruction Set Architecture manual (version 1.2, distributed with CUDA 2.0 Beta 2), Chapter 7, pg. 63, in the Release Notes for the BRA instruction, it states:

Indirect branch through a register is unimplemented

Jeremy Furtek

To: Flolo

Every thread has it’s own qualifier (which could be different for different threads) and expected divergent execution wouldn’t differ from existing if…elseif…elseif… structure where different threads execute different code in divergent execution too. So divergence exists and it is possible on SIMD. The only difference would be faster execution because any branch has equal time to reach it’s code avoiding unnecessary comparisons.

Thanks, I miss that info at bottom part of table in pdf. Why they put it in the list of instructions when it is not implemented … maybe it will exist in future versions.

thanks again

Afaik the divergence is handeld by serialisation. Which means nothing else than every possible code is executed on all processors, and the cpus in a different path just ignore the nonrelevant code (which is more or less a 1-to-1 corrospondence to the generated if structure). So they still fetch instructions from same address, which leaves no room for calculated branches.

The only way where it would be possible to make use in the SIMD-environment of the calculated branches, would be if the the select expression would be independent from the cpu, e.g. something like a shared variable.

If I understand, you are saying that some ALUs fetch code from memory even if it shouldn’t execute on those ALUs?

For example simple code;

if (threadIdx.x==0) mem[0]=5

elseif (threadsIdx.x==1) mem[1]=10

elseif (threadIdx.x==2) mem[2]=15;

synchronize;

Let observe threads with indexes 0,1 and 2 supposing they are synchronized and now reach first line. All threads will do comparisons with 0 and only thread with index 0 will satisfy conditions. NOW you are saying all threads will fetch mem[0]=5 instruction (actually it is several assembler instructions) from memory but only thread with index 0 will execute it and threads 1 and 2 will fetch all of those but ignore . By your claims after that all three threads will fetch next compare instruction but thread with index 0 will ignore it and only threads 1 and 2 will execute it. And after that comparison all threads will fetch mem[1]=10 but only thread 1 will execute it and threads 0 and 2 will ignore. And so on…

Can you confirm that you are sure it works like that?

EDIT:

I think it is not right way how MP executes divergent code. If it would be true then you would never need to synchronize threads because all threads always read instruction from same address and all PCs (program counters) are always the same. But it is not true.

By my opinion, MP has only one instruction fetching mechanism so it can fetch only one instruction at given time and execution of divergent threads in previous example will be serialized on this way:

All threads do comparisons from the first line, after that thread0 has branch to location of instructions (mem[0]=5) while other threads have branch to second comparisons and must wait until they get access to fetching mechanism. Thread0 executes its branch instruction (jump) and waits until it gets fetching mechanism again. Now, second comparison instruction is fetched and executed on threads 1 and 2. Then thread0 fetch its instruction (probably moving immediate constant value 5 into a register), then thread1 execute jump on mem[1]=10, thread2 execute third comparisons, thread0 execute register to memory 0 store instruction, thread1 execute constant 10 to register instruction, thread2 execute jump on mem[2]=15, thread0 fetches and executes jump on thread synchronize, thread1 stores register to memory 1 location, thread2 stores constant 15 to a register, thread0 waits until other threads reach sync point, thread1 jumps on thread synchronize, thread2 store register to memory location 2, thread1 waits on sync point with thread0, thread2 jumps on thread synchronize and after that all threads are synchronized …

As you can see thread0 reaches sync point earlier and must waits for thread1 and thread2. As the result you have:

  • 6 serialized branch instructions (one pair for each branch where first from the pair is to jump into case code and second is to jump out)

  • 3 serialized mov (immediate constant value to register) instructions

  • 3 serialized store (register to memory location) instructions

  • 2 serialized compare instructions (first compare is done in parallel)

Using precalculated jump you would have:

  • 6 serialized branch instructions (one pair for each branch where first from the pair is to jump into case code and second is to jump out)

  • 3 serialized mov (immediate constant value to register) instructions

  • 3 serialized store (register to memory location) instructions

without serialized cmp instructions!

And all threads reach sync point at the same time!

Not relly, the ALU doesnt fetch code at all, you have for fetching instruction a seperate functional unit. (And because multiple Alus share one of these, they have to execute the same operation)

A not so good chosen example, a good compiler would/should optimize it to

mem[threadIdx.x] = 5 * (threadIdx.x + 1);

where you have no divergence at all.

To end any speculation I tried to find some description in the docs.

So on warp level they share the same instruction pointer and behave like I said. Every processor

has of course an own instruction pointer, and diverging warps dont need to serialize as long as there is now divergence within the warp.

That is exactly described in my previous post meaning the serialization is done on such way that threads run their divergent branches serialized but simultaneously by changing ownership of fetching mechanism between ALUs. Compile previous sample code and look ptx file and try to analyze execution flow for each thread separately. You will see branching with register as argument will be more efficient (when it gets implemented) then sequential if…elseif…elseif structure.

I am sorry, but serialized threads branches cannot be run simultaneously, you only have 1 instruction decoder. Also keep in mind that what you see in ptx might be different on GPU. You can try decuda to see what the GPU is really running.

ALU’s don’t fetch code. The instruction decoder decodes the instruction for the next warp, then the ALU’s run the instruction 4 times, possibly skipping when the thread they are servicing should not run because of a divergent branch.

__synchtreads() is needed when you have more than 1 warp in a block. When you have 32 threads or less per block, you do not need to use __syncthreads().

I agree with almost all of that and definitely terminology makes confusion.

When I said simultaneously I didn’t mean in parallel. Look, if you have two divergent threads suppose each of them has 10 different assembler instructions to execute from different addresses. So Flolo said the one of such thread will do nothing until other completely finishes (in this example all 10 instructions). But I said that it is not true, such threads will run like one thread execute one instruction then second thread execute one instructions and so on. And that means simultaneously and serialized but not in parallel. Actually instructions are serialized (not whole branches as Flolo said) and such way of execution allow implementation of bra reg instruction.

To answer your original question, if you read the PTX spec (p.63) you’ll see that indirect branch though a register is not implemented - the G80 hardware does not support this. So there is currently no other way to implement switch statements.

Yes, I know that thanks to Jeremy Furtek already pointed on that information in PTX pdf. Latter discussion was only about IF nvidia implement such instruction (bra reg) in future hardware someone thinks there wouldn’t benefits from that due to SIMD architecture but I do not agree as I explained earlier. Thats all.

Thanks again.