unexpected loop unrolling

Another bug (?) I found when working on my Data Encryption Standard kernel.

narrowed problem:

[codebox]#define type uint64_t

#define E(in0) (((in0<<11)&0xfc0000000)|((in0<<3)&0xfc0)|((in0<<7)&0xfc0000)|((in0<<9)&0x3f000000)|((in0<<47)&0x800000000000)|((in0>>31)&0x1)|((in0<<15)&0x7c0000000000)|((in0<<1)&0x3e)|((in0<<13)&0x3f000000000)|((in0<<5)&0x3f000))

device constant type const_data[128];

global void PERM(type* Result){

for(int i=0;i<8;i++){

	Result[i]=E(const_data[i]);

}

}[/codebox]

compiles to: Used 12 registers, 8+16 bytes smem, 1024 bytes cmem[0], 8 bytes cmem[1]

and decuda shows the loop is not unrolled.

But when changed to:

[codebox]global void PERM(type* Result){

type tmp0,tmp1;

for(int i=0;i<8;i++){

	tmp0=const_data[i];tmp1=E(tmp0);Result[i]=tmp1;

}

}[/codebox]

it compiles to: Used 60 registers, 8+16 bytes smem, 1024 bytes cmem[0], 4 bytes cmem[1]

and decuda shows the loop is unrolled (alot of code, and not a single branch instruction). The second snippet compiles to exacly same decuda code as the first one with “#pragma unroll 8” before the loop.

Also “#pragma unroll 0” doesn’t work, which makes the bug nasty.

Do you have any idea why loop unrolling increases your register usage?

yes and no.

I guess the compiler tries to avoid 24 cycles register read-after-write latency and together with the [topic=“166681”]64bit datatype[/topic] it does this very,very wrong.

Note that as described in linked post, when I change the order of computation from “SHIFT first AND later” to “AND first SHIFT later” it does use 12 registers even after unrolling (and alot less instructions too). My bet would be a bug handling this datatype.

At some point it completely broke my kernel, but I got over it by using 32bit types explicitly :)

Ah, one more thing. maxregcount does not help in this case - it starts spilling to lmem :/