Bit manipulation instructions in Fermi Looking for bitalign like thing

It was a small discussion earlier here related to crypto performance/integer calculations overall (http://forums.nvidia.com/index.php?showtopic=154752).

Shortly,

And actually for ATI 5XXX such instruction exists:

Now with Fermi release I’ve tried to found NVIDIA’s bitalign and failed. Looking at PTX ISA v2.0 there are two new bit manipulation instructions: bfe – bit field extract and bfi – bit field insert. While bfe looks like the closest match for bitalign I’m looking for, there are two issues:

  1. bfe and bfi not exposed at all at C level. I mean there no signs of them at header files (brev added with PTX 2.0 exists for example).

  2. bfe can only takes 32-bit or 64-bit operand as input value, it isn’t possible to use 2x32-bit values. So in fact bfe(d, a, start, len) is just replacement for d = (a>>start) & (1<<len) - 1) for unsigned values.

So question is – is it possible at all to use bfe for 32-bit cyclic rotation? Obviously it’ll bfe with 64-bit operand but how 64-bit operands handled by hardware, are there real 64-bit registers or they emulated with 2x32-bit ones? If it’s 64-bit then it’s simple pointless to convert 32 to 64, perform bfe and convert back, easier to stay with old “shift left, shift right, logical or” sequence. If they emulated then how to access low and high parts of 64-bit registers with PTX? And is there a point to do so, i.e. will it be compiled into single instruction?

Any ideas?

Hi Ivan,
I suggest you take a look at the disassembler from the Nouveau project, nv50dis (nvc0dis actually): http://0x04.net/cgit/index.cgi/nv50dis/tree/nvc0dis.c.
So it seems bfe and bfi are supported (as ext and ins), through it is not currently known whether their 64-bit forms are.

It would be interesting to experiment with how the 64-bit bfe/bfi PTX instructions translate to Fermi assembly… (objdump -s on ELF cubins followed by some awk magic should produce something that can be fed to nvc0dis.)

Accessing low and high parts of a 64-bit value should not be a problem, at least at assembly level. At C or PTX level, you will need to experiment with shifts or casts to uint2. Hopefully the compiler will detect the idiom and generate the expected code…