I am frankly surprised that you observed a largish performance difference from switching ORs to ADDs. I would have expected a small difference only.
You can dump the SASS (machine code) with
cuobjdump --dump-sass and compare the variants. On modern processors logical instructions and integer adds are usually handled by the same ALUs with the same latency and throughput and are therefore interchangeable from a performance perspective. I would assume that logical operations require less energy, though.
Older GPU architectures had an
ISCADD instruction (integer scale and add), which does a left shift plus an add. It looks like the latest GPU architectures (Ampere and Turing) may have eliminate this instructon. But newer GPU architectures support an
LEA instruction (intended to speed up 64-bit addressing computations) which can be used in much the same way, and some GPU architectures have fast 32-bit IMAD (integer multiply and add) that can also be used, but is energetically more expensive. Newer GPU architectures make maximum use of 3-input operations, in particular
IADD3 (three-input add, presumably implemented as a carry-save adder bolted to a carry-propagate adder) and LOP3 (implements any logical operation of three operands; conceptually a lookup table, but not necessarily implemented that way). On the latest architectures these have probably the same throughput as plain
LOP. The throughput of the
PRMT unit likely varies by GPU architecture and is likely lower than simple ALU operation because it is not frequently needed.
I would expect any efficient byte gathering operations expressed in CUDA code at HLL level to be mapped to a mix of some of the operations enumerated above.
x86 has limited scale-and-add in the form of the
LEA instruction, but obviously the available scale factors (1x, 2x, 4x, 8x) are not directly sufficient for byte manipulation. PowerPC processors have a versatile
rlwimi (rotate left word immediate plus masked insert) which the compilers do an excellent job of utilizing for byte insert and extract operations.
The last x86 processor architecture I helped create was the AMD Athlon processor (so 20+ years ago). While it used an x86 instruction to internal instruction set translation mechanism, there weren’t any clever optimization that combined internal operations across x86 instruction boundaries. At least not that I recall. I don’t know what modern x86 processors can do.