LOP3 Throughput

rs27 · July 26, 2019, 12:22am

As it doesn’t appear as a specific entry in Table 2: Throughput of Native Arithmetic Instructions. (Number of Results per Clock Cycle per Multiprocessor), in the Programming Guide here:

[url]https://docs.nvidia.com/cuda/cuda-c-programming-guide/#maximize-instruction-throughput[/url]

is it fair to assume that the LOP3 instuction is the same as for “32-bit bitwise AND, OR, XOR” ?

Thanks.

xgr_1986 · July 26, 2019, 2:12pm

LOP3 is a versatile instruction that can handle all rules for 3-input logical operations. It has become the building block of bitwise logical operations long before. AFAIK, since maxwell, all bitwise AND, OR, XOR will be mapped to LOP3 with proper LUT. It is one among the fastest instructions with maximum instruction throughtput and lowest latency, same as FFMA.

Topic		Replies	Views
Throughput for certain integer arithmetic instructions. CUDA Programming and Performance	5	1770	January 15, 2020
Is it a good idea to convert all logical operators into bitwise operators to stop short-circuiting for better warp divergence? CUDA Programming and Performance	4	67	March 3, 2025
32/64 bit question CUDA Programming and Performance	3	388	February 15, 2024
estimate 64bit integer instruction throughput CUDA Programming and Performance	4	852	September 29, 2018
Reverse LUT for LOP3.LUT CUDA Programming and Performance	5	2833	December 30, 2023
What does LOP3.LUT mean? How is it executed? CUDA Programming and Performance	22	4477	February 8, 2024
Scaling on different architectures CUDA Programming and Performance	8	695	April 29, 2021
Throughputs of the 64-bit sine and cosine instructions CUDA Programming and Performance	2	474	January 31, 2022
Peak Performance of integer operation CUDA Programming and Performance	3	2885	May 11, 2017
Efficient implementation of bitwise majority-of-N operations for N in {3, 5, 7, 9} CUDA Programming and Performance	6	324	December 23, 2023

LOP3 Throughput

Related topics