LOP3 Throughput

As it doesn’t appear as a specific entry in Table 2: Throughput of Native Arithmetic Instructions. (Number of Results per Clock Cycle per Multiprocessor), in the Programming Guide here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#maximize-instruction-throughput

is it fair to assume that the LOP3 instuction is the same as for “32-bit bitwise AND, OR, XOR” ?

Thanks.

LOP3 is a versatile instruction that can handle all rules for 3-input logical operations. It has become the building block of bitwise logical operations long before. AFAIK, since maxwell, all bitwise AND, OR, XOR will be mapped to LOP3 with proper LUT. It is one among the fastest instructions with maximum instruction throughtput and lowest latency, same as FFMA.