Another bug (?) I found when working on my Data Encryption Standard kernel.
narrowed problem:
[codebox]#define type uint64_t
#define E(in0) (((in0<<11)&0xfc0000000)|((in0<<3)&0xfc0)|((in0<<7)&0xfc0000)|((in0<<9)&0x3f000000)|((in0<<47)&0x800000000000)|((in0>>31)&0x1)|((in0<<15)&0x7c0000000000)|((in0<<1)&0x3e)|((in0<<13)&0x3f000000000)|((in0<<5)&0x3f000))
device constant type const_data[128];
global void PERM(type* Result){
for(int i=0;i<8;i++){
Result[i]=E(const_data[i]);
}
}[/codebox]
compiles to: Used 12 registers, 8+16 bytes smem, 1024 bytes cmem[0], 8 bytes cmem[1]
and decuda shows the loop is not unrolled.
But when changed to:
[codebox]global void PERM(type* Result){
type tmp0,tmp1;
for(int i=0;i<8;i++){
tmp0=const_data[i];tmp1=E(tmp0);Result[i]=tmp1;
}
}[/codebox]
it compiles to: Used 60 registers, 8+16 bytes smem, 1024 bytes cmem[0], 4 bytes cmem[1]
and decuda shows the loop is unrolled (alot of code, and not a single branch instruction). The second snippet compiles to exacly same decuda code as the first one with “#pragma unroll 8” before the loop.
Also “#pragma unroll 0” doesn’t work, which makes the bug nasty.