Bytes manipulation in PTX

Yes, I was thinking of page 58 - 59, but just looked again, to see they are “Control and Other” instructions, so not hugely useful.

Yes, the Programming Guide shows the dreaded “Multiple Instructions”.

Multiple fast instructions combined can still be faster than instructions using a dedicated functional unit with lower throughput. I should take a close look some time, but I think BFI and BFE would boil down to just two to three modern instructions (at least the common flavors). I think these were quarter-throughput on older GPUs?

Half throughput on 6.X