multiword bit shifting (long intgers)

Hello,

my GPU code uses “long integers” (3 / 6 32bit unsigned integers for 96 / 192 bit numbers).

Regulary I need to shift bits on this long ints.

Example code:

// 192bit (6x 32bit) integer: D=d0 + d1*(2^32) + d2*(2^64) + ...

typedef struct

{

  unsigned int d0,d1,d2,d3,d4,d5;

}int192;

...

int192 nn;

int shl, shr;

shl = 15;

shr = 32 - shl;

...

// shiftleft nn <shl> bits

nn.d5 = (nn.d5 << shl) + (nn.d4 >> shr);

nn.d4 = (nn.d4 << shl) + (nn.d3 >> shr);

nn.d3 = (nn.d3 << shl) + (nn.d2 >> shr);

nn.d2 = (nn.d2 << shl) + (nn.d1 >> shr);

nn.d1 = (nn.d1 << shl) + (nn.d0 >> shr);

nn.d0 =  nn.d0 << shl;

Any ideas / hints for a faster variant?

Oliver

Hello,

my GPU code uses “long integers” (3 / 6 32bit unsigned integers for 96 / 192 bit numbers).

Regulary I need to shift bits on this long ints.

Example code:

// 192bit (6x 32bit) integer: D=d0 + d1*(2^32) + d2*(2^64) + ...

typedef struct

{

  unsigned int d0,d1,d2,d3,d4,d5;

}int192;

...

int192 nn;

int shl, shr;

shl = 15;

shr = 32 - shl;

...

// shiftleft nn <shl> bits

nn.d5 = (nn.d5 << shl) + (nn.d4 >> shr);

nn.d4 = (nn.d4 << shl) + (nn.d3 >> shr);

nn.d3 = (nn.d3 << shl) + (nn.d2 >> shr);

nn.d2 = (nn.d2 << shl) + (nn.d1 >> shr);

nn.d1 = (nn.d1 << shl) + (nn.d0 >> shr);

nn.d0 =  nn.d0 << shl;

Any ideas / hints for a faster variant?

Oliver

No, each op is pretty basic so you’re not going to save any instructions anywhere. CUDA doesn’t have a circular shift intrinsic so you pretty much need all those left-right shift pairs.

Your algorithm may not be compatible with this… but you could represent your 192 bit integer with 13 words and not 6, storing 15 bits per word. Then a shift left of 15 bits is just 12 register to register copies, and finally setting d0 to 0. (If you’re doing big-int math, this may have useful side effects since any 3-argument multiply-add won’t overflow… even a big accumulation of ab+cd+ef+gh won’t overflow.)
It may even be possible to use 16 bit words in the structure (each holding 15 bits of data)… I think, but am not sure, that half-word register to register copies can be done in one instruction.

No, each op is pretty basic so you’re not going to save any instructions anywhere. CUDA doesn’t have a circular shift intrinsic so you pretty much need all those left-right shift pairs.

Your algorithm may not be compatible with this… but you could represent your 192 bit integer with 13 words and not 6, storing 15 bits per word. Then a shift left of 15 bits is just 12 register to register copies, and finally setting d0 to 0. (If you’re doing big-int math, this may have useful side effects since any 3-argument multiply-add won’t overflow… even a big accumulation of ab+cd+ef+gh won’t overflow.)
It may even be possible to use 16 bit words in the structure (each holding 15 bits of data)… I think, but am not sure, that half-word register to register copies can be done in one instruction.

Hello!

I should have mentioned it: the amount of how many bits are shifted are not constant.
The amounts are well known at specifc parts of the code but there are many places with different shifts.
Single bit shl is done with add.cc.u32, addc.cc.u32, … addc.u32. (I’ve build my own intrinsics for this, add with carry is a very nice feature for my code.) :)

According to CUDA_C_Programming_Guide.pdf a 32bit integer shift are kindof slow on Fermi (half the speed of a pre-Fermi per shader core…). :(

Oliver

Hello!

I should have mentioned it: the amount of how many bits are shifted are not constant.
The amounts are well known at specifc parts of the code but there are many places with different shifts.
Single bit shl is done with add.cc.u32, addc.cc.u32, … addc.u32. (I’ve build my own intrinsics for this, add with carry is a very nice feature for my code.) :)

According to CUDA_C_Programming_Guide.pdf a 32bit integer shift are kindof slow on Fermi (half the speed of a pre-Fermi per shader core…). :(

Oliver