How to force nvcc to compile switch structure using assembler branch instruction with precalculated jump address in register?
Followed code construction:
switch(variable) {
case 0: casecode0; break;
case 1: casecode1; break;
case 2: casecode2; break;
case 3: casecode3; break;
}
Supposing size of casecode0 = size of casecode1 = size of casecode2 = size of casecode3
(even if sizes are not equal, compiler can align each particular casecode to value of largest casecode or better more portable technique but slightly slower is to replace every casecode with single unconditional branch instruction which will jump to address of it’s code. On that way case structure is a list of branch instructions with fixed known size and jump to each address could be easy calculated.)
Previous code is expected to be compiled to something like:
mov r1,Case0Address
mov r2,codesize
mul r2,variable
add r1,r2
bra r1
…
@Case0Address:
code0
bra ExitSwitch
@Case1Address:
code1
bra ExitSwitch
@Case2Address:
code2
bra ExitSwitch
@Case3Address:
code3
bra ExitSwitch
…
@ExitSwitch
or more portable way (slightly slower but doesn’t require all case codes are aligned to size of largest, meaning memory is saved)
mov r1,CaseList
mov r2,variable
mul r2,SizeOfBranchInstruction /* determined by hardware architecture */
add r1,r2
bra r1
…
@CaseList:
bra Case0Address
bra Case1Address
bra Case2Address
bra Case3Address
…
@Case0Address:
code0
bra ExitSwitch
@Case1Address:
code1
bra ExitSwitch
@Case2Address:
code2
bra ExitSwitch
@Case3Address:
code3
bra ExitSwitch
…
@ExitSwitch
That is expected (or similar like that) where each case require same time of execution branch (if codes are equal in length) and all threads executing the code will reach ExitSwitch after equal number of executed instructions.
Unfortunately nvcc compiles switch construction like sequence of if elseif elseif elseif statements meaning previous code will be compiled like you did write
if(variable==0) code0
elseif(variable==1) code1
elseif(variable==2) code2
elseif(variable==3) code3
I was disappointed after looking in ptx file.
This way is much much slower especially if construction has large number of cases. For instant example suppose each thread has it’s own qualifier (variable) on which depends which code of let say 16 codes should be executed. With if…elseif… structure, all threads with qualifier==16 must do 15 wasted comparisons. Only threads with qualifier==1 will execute their code after only one comparison.
The main reason why switch construction is implemented in standard C language is obvious on previous example. It doesn’t suffer for such problem. It can use several approach to optimize and calculate branch addresses without sequential comparisons (in some specific cases it uses shifting and masking instructions to calculate address of jump but more often it doesn’t). The only hardware requirement is assembler jump (or call) instruction with register as argument and it exists on nvidia’s GPUs.
So my question is why nvcc doesn’t use it when it exists? Second question, did anyone write any C code which compiled by nvcc produce
bra r1
instruction in ptx fajl?