compiler problems with 64bit datatype and logical instructions

Hello.

I have a quite large code implementing Data Encryption Standard, but it compiles under toolkit 2.3 to completely unacceptable amount of resources (60+ registers and spilling, --maxregcount only increases spilling).

I narrowed the problem to two code snippets which do the exact same thing: they implement a tiny part of E permutation. When doing this you can either do SHIFT first and AND later, or AND first, SHIFT later, but they compile completely different. You’ll see the problems in decuda disassembly.

[codebox]

#define type uint64_t

#define X(data) ((data&(0x1f80000))<<11)

#define Y(data) ((data<<11)&0xc0000000)

device constant type const_data[128];

global void PERM(type* Result){

Result[0]=Y(const_data[0]);

}

[/codebox]

this kernel compiles and disassembles into:

[codebox].entry _Z4PERMPm

{

.lmem 0

.smem 24

.reg 4

.bar 0

mov.b32 $r0, c0[0x0000]

shl.u32 $r1, $r0, 0x0000000b

mov.b32 $r0, s[0x0010]

mov.b32 $r3, $r124

and.b32 $r2, $r1, c1[0x0000]

mov.end.b64 g[$r0], $r2

#.constseg 1:0x0000 const

#{

#d.u32 0xc0000000 // 0000

#}

}[/codebox]

but when i do

[codebox]

Result[0]=X(const_data[0]);

[/codebox]

it disassembles into:

[codebox]

.entry _Z4PERMPm

{

.lmem 0

.smem 24

.reg 4

.bar 0

mov.b32 $r0, 0x01f80000

mov.b32 $r1, 0x00000020

and.b32 $r2, $r0, c0[0x0000]

add.u32 $p0|$o127, $r1, c1[0x0000]

shr.u32 $r1, $r2, 0x00000015

@$p0.sf mov.b32 $r1, $r124

mov.b32 $r0, s[0x0010]

shl.u32 $r2, $r2, 0x0000000b

mov.b32 $r3, $r1

mov.end.b64 g[$r0], $r2

#.constseg 1:0x0000 const

#{

#d.u32 0xfffffff5 // 0000

#}

}[/codebox]

my questions are:

why the second code has twice the amount of instructions? (it also uses MUCH more registers in full implementation)

in the second code the additional instructions look useless to me, especially the “add.u32 $p0|$o127, $r1, c1[0x0000]”. Am i missing something?

in first snippet, “mov.b32 $r3, $r124”, wtf is $r124?

when i changed the code to explicitly use 32 bit datatype, it compiled to 16 registers (which i later optimized to 10, yay, full occupancy :) )

but why 64bit code behaves so bad? is it compiler bug?