Bytes manipulation in PTX

rs277 · September 13, 2021, 9:43am

Yes, I was thinking of page 58 - 59, but just looked again, to see they are “Control and Other” instructions, so not hugely useful.

Yes, the Programming Guide shows the dreaded “Multiple Instructions”.

njuffa · September 13, 2021, 9:49am

Multiple fast instructions combined can still be faster than instructions using a dedicated functional unit with lower throughput. I should take a close look some time, but I think BFI and BFE would boil down to just two to three modern instructions (at least the common flavors). I think these were quarter-throughput on older GPUs?

rs277 · September 13, 2021, 9:54am

Half throughput on 6.X

rs277 · March 2, 2023, 2:16am

Another option for byte extraction, (which I offer for the sake of future searches on the topic), is use of the dp4a instruction (SM >= 6.1), returned from the __dp4a(a, b, c) intrinsic with a mask selecting the required byte, eg:

int x = __dp4a(byte-packed-int, 0x00000100, 0);

This has the benefit of being a full throughput instruction, compared to the alternative instructions (prmt, bfe, mask and shift), so is probably only of interest in “inner loop” type situations.

Where it gets somewhat more useful, (if somewhat niche), is when the byte being extracted is being used as the index into a small shared memory array, when the multiply/accumulate function can be utilised, via a small PTX function, to also perform the byte based address calculation to the LD.SHARED.XX instruction, thereby eliminating the normally attendant ISCADD or LEA instruction.

This is done by modifying the mask value to reflect the number of bytes in the array element and the “0” parameter is replaced by the array base address. This can only be used with shared memory as the address is a 32 bit pointer.

In my particular use case, removing the addressing instructions from the inner loop this way results in a 14% gain.