What does LOP3.LUT mean? How is it executed?

I see a lot of this instruction in SASS code.
for example:

            LOP3.LUT R45, R42, R19, RZ, 0x3c, !PT 

It seems the R45 register is output and the others are input.
How the output is calculated?

Its SASS, so its not well documented. However you can get some insight by studying the corresponding PTX instruction. This may be of interest.

Thanks a lot.

Another question,
Because ALU is very busy(98% utilized), I want to move some operation from ALU to FMA.
In my code, one heat point is:

a = a + b + c

each variable is an unsigned int.
The compiler use IADD3 to calculate the sum which is executed in ALU. I want to use two IMAD instruction to calculate the sum so that some work can be moved from ALU to FMA. I wrote ptx as below:

__device__ uint32_t add(uint32_t a, uint32_t b, uint32_t c){
    uint32_t ret = 0;
    asm("mad.lo.u32 %0, %1, 1, %2;"
        "mad.lo.u32 %0, %0, 1, %3;"
        :"+r"(ret):"r"(a), "r"(b), "r"(c));
    return ret;
}

However, the add() function was still optimized to use single IADD3 instruction. How can I force two mad instruction be used?

I donā€™t know how to force the compiler to generate IMAD instead of IADD3. I guess you could try various ptxas optimization levels but that is just likely to crater your performance. Another possibility might be to pass the multiplier of 1 in such a way (e.g. via a register) that the compiler cannot discover it as a compile-time constant:

__device__ uint32_t add(uint32_t a, uint32_t b, uint32_t c, uint32_t m){
    uint32_t ret = 0;
    asm("mad.lo.u32 %0, %1, %4, %2;"
        "mad.lo.u32 %0, %0, %4, %3;"
        :"+r"(ret):"r"(a), "r"(b), "r"(c), "r"(m));
    return ret;
}

however its going to take more than just that change. You might have to really obfuscate the 1 value (like, pass it as a kernel argument) or else use rdc device-code linking, and keep the constant 1 in a separate compilation unit from the above function.

Later:

compiling the above code for cc8.6, and ā€œhidingā€ the 1 multiplier, can result in two IMAD instructions:

$ cat t2107.cu
#include <cstdint>
__device__ uint32_t add(uint32_t a, uint32_t b, uint32_t c, uint32_t m){
    uint32_t ret = 0;
    asm("mad.lo.u32 %0, %1, %4, %2;"
        "mad.lo.u32 %0, %0, %4, %3;"
        :"+r"(ret):"r"(a), "r"(b), "r"(c), "r"(m));
    return ret;
}

__global__ void k(uint32_t a, uint32_t b, uint32_t c, uint32_t m, uint32_t *r){

  *r = add(a, b, c, m);
}


$ nvcc -c t2107.cu -arch=sm_86
$ cuobjdump -sass t2107.o

Fatbin elf code:
================
arch = sm_86
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_86
                Function : _Z1kjjjjPj
        .headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;            /* 0x00000a0000017a02 */
                                                                     /* 0x000fc40000000f00 */
        /*0010*/                   MOV R5, c[0x0][0x16c] ;           /* 0x00005b0000057a02 */
                                                                     /* 0x000fe20000000f00 */
        /*0020*/                   ULDC.64 UR4, c[0x0][0x118] ;      /* 0x0000460000047ab9 */
                                                                     /* 0x000fe20000000a00 */
        /*0030*/                   MOV R0, c[0x0][0x160] ;           /* 0x0000580000007a02 */
                                                                     /* 0x000fe40000000f00 */
        /*0040*/                   MOV R2, c[0x0][0x170] ;           /* 0x00005c0000027a02 */
                                                                     /* 0x000fe40000000f00 */
        /*0050*/                   MOV R3, c[0x0][0x174] ;           /* 0x00005d0000037a02 */
                                                                     /* 0x000fe20000000f00 */
        /*0060*/                   IMAD R0, R5, R0, c[0x0][0x164] ;  /* 0x0000590005007624 */
                                                                     /* 0x000fc800078e0200 */
        /*0070*/                   IMAD R5, R0, R5, c[0x0][0x168] ;  /* 0x00005a0000057624 */
                                                                     /* 0x000fca00078e0205 */
        /*0080*/                   STG.E [R2.64], R5 ;               /* 0x0000000502007986 */
                                                                     /* 0x000fe2000c101904 */
        /*0090*/                   EXIT ;                            /* 0x000000000000794d */
                                                                     /* 0x000fea0003800000 */
        /*00a0*/                   BRA 0xa0;                         /* 0xfffffff000007947 */
                                                                     /* 0x000fc0000383ffff */
        /*00b0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*00c0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*00d0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*00e0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*00f0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0100*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0110*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0120*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0130*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0140*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0150*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0160*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0170*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
                ..........



Fatbin ptx code:
================
arch = sm_86
code version = [7,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$
2 Likes

LOP3.LUT is an instruction that can compute any logical operation with three inputs, using a lookup table (LUT) to do so. The basic idea is similar to that of FPGAs using complex building blocks (CLBs), some of which can compute any logical operation of up to five inputs.

The introduction of LOP3 was a smart move on the part of NVIDIA by taking full advantage of a data path already designed for three-input operations, as it has to support FMA operations. Therefore also the existence of IADD3, which is only incrementally harder to implement than regular addition (simply add a carry-save adder in front of the carry-propagate adder), and the prevalence of IMAD instructions in code generated for Ampere.

The 8-bit lookup table (truth table) for LOP.LUT3 is specified in the instruction encoding itself, and is not difficult to reverse engineer. While LOP3.LUT is great for performance, and the CUDA compiler does a pretty good (but often not perfect where there is a sequences of LOP3s) job of exploiting it, it makes reading SASS (machine code, e.g. output of cuobjdump --dump-sass) completely non-intuitive and a major pain in the behind.

Here are two LOP3 emulations:

/* emulate GPU's LOP3.LUT (three-input logic op with 8-bit truth table) */
uint32_t lop3_ref (uint32_t a, uint32_t b, uint32_t c, uint8_t ttbl)
{
    uint32_t r = 0;
    for (int i = 0; i < 32; i++) {
        uint32_t bitidx = ((a & 1) << 2) | ((b & 1) << 1) | (c & 1);
        uint32_t tblbit = (ttbl >> bitidx) & 1;
        r = r | (tblbit << i);
        a >>= 1;
        b >>= 1;
        c >>= 1;
    }
    return r;
}

/* emulate GPU's LOP3.LUT (three-input logic op with 8-bit truth table) */
uint32_t lop3_fast (uint32_t a, uint32_t b, uint32_t c, uint8_t ttbl)
{
    uint32_t r = 0;
    if (ttbl & 0x01) r |= ~a & ~b & ~c;
    if (ttbl & 0x02) r |= ~a & ~b &  c;
    if (ttbl & 0x04) r |= ~a &  b & ~c;
    if (ttbl & 0x08) r |= ~a &  b &  c;
    if (ttbl & 0x10) r |=  a & ~b & ~c;
    if (ttbl & 0x20) r |=  a & ~b &  c;
    if (ttbl & 0x40) r |=  a &  b & ~c;
    if (ttbl & 0x80) r |=  a &  b &  c;
    return r;
}
2 Likes

As for the quest to replace one IADD3 by two IMADs, the fact that ptxas transforms the latter into the former is a pretty good indication that this is not a performance win as hoped for. ptxas is not generally shy to use IMAD, which appears to be a high-throughput instruction in Ampere family GPUs.

Please note that, despite the name, ptxas is an optimizing compiler, and trying to coerce it into generating any particular SASS by manipulating PTX code is extremely brittle and likely doomed to failure.

It works. Thanks a lot.

As Robert suggested, I moved calculation from ALU to FMU and increased performance a lot. In most cases, using two IMAD instructions to calculate sum of three integers is worse than using single IADD3 instruction, so the compiler always try to using single IADD3 instruction. However in my case, as ALU is too busy, itā€™s beneficial to use two FMU instruction instead of single ALU instruction. Thanks for your response.

1 Like

Thanks njuffa, the explanation and emulate of LOP3 here is very clear. But these codes only explain what the first four operands do, I wonder what the fifth operand is used to control. I came across two fifth operands when I examined the result of the S32 divide assembly, here is my kernel:

#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cuda.h>

__global__ void test_2_para_int(int x, int y, int *output) {
	*output = x / y;
}

int main() {
	using namespace std;
	int x, y;
	x= 0x7f000000;
	y= 0xff7fffff;

	int* d_output;
	cudaMalloc((int**)&d_output, sizeof(int));
	test_2_para_int<<<1, 1>>>(x, y, d_output);
	int output;
	cudaMemcpy(&output, d_output, sizeof(int), cudaMemcpyDeviceToHost);
	cout << "output: " << output << endl;
}

and below is the assembly code by sm_86, I saw both ULOP3.LUT UR4, UR4, UR5, URZ, 0x3c, !UPT and LOP3.LUT R5, RZ, c[0x0][0x164], RZ, 0x33, !PT in my result, I tried to analyze it with the code you gave, in this case I got input with $UR4=0x7f000000, $UR5=0x800001, the hand calculation should be 0x7f800001, but I got 0x807fffff under cuda-gdb, I wonder if the fifth operand has a ā€˜negateā€™ control.

	code for sm_86
		Function : _Z15test_2_para_intiiPi
	.headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;               /* 0x00000a00ff017624 */
                                                                                         /* 0x000fc400078e00ff */
        /*0010*/                   IABS R5, c[0x0][0x164] ;                              /* 0x0000590000057a13 */
                                                                                         /* 0x000fe20000000000 */
        /*0020*/                   ULDC.64 UR4, c[0x0][0x160] ;                          /* 0x0000580000047ab9 */
                                                                                         /* 0x000fe40000000a00 */
        /*0030*/                   ULOP3.LUT UR4, UR4, UR5, URZ, 0x3c, !UPT ;            /* 0x0000000504047292 */
                                                                                         /* 0x000fe2000f8e3c3f */
        /*0040*/                   I2F.RP R0, R5 ;                                       /* 0x0000000500007306 */
                                                                                         /* 0x000e2a0000209400 */
        /*0050*/                   ISETP.LE.AND P1, PT, RZ, UR4, PT ;                    /* 0x00000004ff007c0c */
                                                                                         /* 0x000fe2000bf23270 */
        /*0060*/                   ULDC.64 UR4, c[0x0][0x118] ;                          /* 0x0000460000047ab9 */
                                                                                         /* 0x000fc40000000a00 */
        /*0070*/                   MUFU.RCP R0, R0 ;                                     /* 0x0000000000007308 */
                                                                                         /* 0x001e240000001000 */
        /*0080*/                   IADD3 R2, R0, 0xffffffe, RZ ;                         /* 0x0ffffffe00027810 */
                                                                                         /* 0x001fcc0007ffe0ff */
        /*0090*/                   F2I.FTZ.U32.TRUNC.NTZ R3, R2 ;                        /* 0x0000000200037305 */
                                                                                         /* 0x000064000021f000 */
        /*00a0*/                   IMAD.MOV.U32 R2, RZ, RZ, RZ ;                         /* 0x000000ffff027224 */
                                                                                         /* 0x001fe400078e00ff */
        /*00b0*/                   IMAD.MOV R4, RZ, RZ, -R3 ;                            /* 0x000000ffff047224 */
                                                                                         /* 0x002fc800078e0a03 */
        /*00c0*/                   IMAD R7, R4, R5, RZ ;                                 /* 0x0000000504077224 */
                                                                                         /* 0x000fe200078e02ff */
        /*00d0*/                   IABS R4, c[0x0][0x160] ;                              /* 0x0000580000047a13 */
                                                                                         /* 0x000fc60000000000 */
        /*00e0*/                   IMAD.HI.U32 R3, R3, R7, R2 ;                          /* 0x0000000703037227 */
                                                                                         /* 0x000fc800078e0002 */
        /*00f0*/                   IMAD.MOV.U32 R2, RZ, RZ, c[0x0][0x168] ;              /* 0x00005a00ff027624 */
                                                                                         /* 0x000fe400078e00ff */
        /*0100*/                   IMAD.HI.U32 R3, R3, R4, RZ ;                          /* 0x0000000403037227 */
                                                                                         /* 0x000fca00078e00ff */
        /*0110*/                   IADD3 R0, -R3, RZ, RZ ;                               /* 0x000000ff03007210 */
                                                                                         /* 0x000fca0007ffe1ff */
        /*0120*/                   IMAD R0, R5, R0, R4 ;                                 /* 0x0000000005007224 */
                                                                                         /* 0x000fca00078e0204 */
        /*0130*/                   ISETP.GT.U32.AND P2, PT, R5, R0, PT ;                 /* 0x000000000500720c */
                                                                                         /* 0x000fda0003f44070 */
        /*0140*/              @!P2 IMAD.IADD R0, R0, 0x1, -R5 ;                          /* 0x000000010000a824 */
                                                                                         /* 0x000fe200078e0a05 */
        /*0150*/              @!P2 IADD3 R3, R3, 0x1, RZ ;                               /* 0x000000010303a810 */
                                                                                         /* 0x000fe40007ffe0ff */
        /*0160*/                   ISETP.NE.AND P2, PT, RZ, c[0x0][0x164], PT ;          /* 0x00005900ff007a0c */
                                                                                         /* 0x000fe40003f45270 */
        /*0170*/                   ISETP.GE.U32.AND P0, PT, R0, R5, PT ;                 /* 0x000000050000720c */
                                                                                         /* 0x000fda0003f06070 */
        /*0180*/               @P0 IADD3 R3, R3, 0x1, RZ ;                               /* 0x0000000103030810 */
                                                                                         /* 0x000fca0007ffe0ff */
        /*0190*/                   IMAD.MOV.U32 R5, RZ, RZ, R3 ;                         /* 0x000000ffff057224 */
                                                                                         /* 0x000fe200078e0003 */
        /*01a0*/                   MOV R3, c[0x0][0x16c] ;                               /* 0x00005b0000037a02 */
                                                                                         /* 0x000fc80000000f00 */
        /*01b0*/              @!P1 IADD3 R5, -R5, RZ, RZ ;                               /* 0x000000ff05059210 */
                                                                                         /* 0x000fe40007ffe1ff */
        /*01c0*/              @!P2 LOP3.LUT R5, RZ, c[0x0][0x164], RZ, 0x33, !PT ;       /* 0x00005900ff05aa12 */
                                                                                         /* 0x000fca00078e33ff */
        /*01d0*/                   STG.E [R2.64], R5 ;                                   /* 0x0000000502007986 */
                                                                                         /* 0x000fe2000c101904 */
        /*01e0*/                   EXIT ;                                                /* 0x000000000000794d */
                                                                                         /* 0x000fea0003800000 */
        /*01f0*/                   BRA 0x1f0;                                            /* 0xfffffff000007947 */
                                                                                         /* 0x000fc0000383ffff */

I have not paid attention to this predicate usage before. Maybe it is specific to certain architectures? Also the way the disassembler seems to work it does not display default settings. I guess for LOP3 the default setting is PT, meaning the predicate only gets displayed in disassembly when it is !PT. Are you observing this only with ULOP3 or also with plain LOP3?

A control to invert (oneā€™s complement) the result seems unlikely as that would be functionally redundant: the basic functionality of LOP3 can implement any logic function of three inputs, with NOTs on any inputs or the output as desired.

One would have to look at more examples of this predicate usage to reverse engineer it with confidence. Your experiment with single-stepping through the code is straightforward enough, but the finding (predicate controls inversion of the output) does not make sense to me.

There must be more to this story. Highly speculative: could the conditional inversion be a power-saving feature depending on the 8-bit truth-table encoding? In a similar vein: Could it trigger power saving in the data path if the instruction is just used for a two-input logic operation, by bypassing (parts of) the table lookup mechanism?

I only tried ULOP3 calculation here, cause in my case P2=0x1 and the LOP3 instruction is not actually evaluated. I canā€™t tell the difference in behavior between ULOP3 and LOP3 from the current documentation.

the basic functionality of LOP3 can implement any logic function of three inputs, with NOTs on any inputs or the output as desired

This is truly exact, maybe I need to try more lop3 application by ptx isa launch, thanks for your help!

The U prefix designates the ā€œuniformā€ datapath and associated register set. This has come up in this subforum before:

1 Like

In looking at instances of LOP3 and ULOP3 in other codes, I see only !UPT and !PT variants in the disassembly. I have not yet found a case of a {U}LOP3 with a predicate PT or UPT.

Thatā€™s true, and I saw nvidia support .BoolOp after ptx version 8.2, I have an A10 card with cuda version 12.3, but when I wrote asm("lop3.or.b32 %0, %1, %2, %3, %4, %3;" : "=r"(*output) : "r"(x), "r"(y), "r"(z), "n"(0x3c)); will get compile error like Arguments mismatch for instruction 'lop3'. I think I wrote the wrong return type(should return bool not integer, but I forgot how to set bool parameter in inline ptx), and perhaps this fifth operand is used under this condition.

I had not noticed the addition of .BoolOp to the PTX instruction lop3 in recent version of PTX. Thanks for pointing it out. It makes sense that the q predicate for the .BoolOp would correspond to the final operand in the SASS instruction LOP3. Presumably !PT (a.k.a. FALSE) means that no predicate is being written?

I am bit confused on the semantics of the .BoolOp version. According to the syntax (I consulted at version 8.3), the variant with .BoolOp can write a general register or a predicate. The first choice doesnā€™t seem to make sense?

To my knowledge, there is no binding for predicates in inline PTX asm() statements. If you want to use the .BoolOp variant delivering a predicate, you need to convert that predicate into a variable of a type than can be bound. If I have some time, I will experiment a bit with this tomorrow.

After reading the complete section on lop3 in the PTX manual, it is clear that lop3 with a .BoolOp can write either a destination register plus a predicate or just a predicate. The vertical bar in the syntax does not express choice (either general register or predicate) but is actually a separator that separates the two destinations. Not clear why they did not use the (dest1, dest2) syntax used elsewhere when two destinations need to be specified. From the generated SASS it is clear that the disassembler treats POR as the default mode, and PAND as the alternate mode when .BoolOp is used.

Here is the little program I created for my exploration:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

__global__ void kernel (uint32_t x, uint32_t y, uint32_t z, int pred_in)
{
    uint32_t dest;
    int pred_out;
    asm ("\n\t"
         ".reg .pred p, q;\n\t"
         ".reg .u32 a, b, c, d, t;\n\t"
         "mov.b32      a, %2;\n\t"
         "mov.b32      b, %3;\n\t"
         "mov.b32      c, %4;\n\t"
         "mov.b32      t, %5;\n\t"
         "setp.ne.b32  q, t, 0;\n\t"
         "lop3.or.b32  d | p, a, b, c, 0x80, q;\n\t" /*a & b & c*/
         "mov.b32      %0, d;\n\t"
         "selp.u32     %1, 1, 0, p;\n\t"
         : "=r"(dest), "=r"(pred_out) : "r"(x), "r"(y), "r"(z), "r"(pred_in));
    printf ("GPU: a=%08x b=%08x c=%08x pred_in=%d d=%08x pred_out=%d\n",
            x, y, z, pred_in, dest, pred_out);
}

int main (void)
{
    uint32_t x = 0x55555555;
    uint32_t y = 0x33333333;
    uint32_t z = 0xff00ff00;
    int pred_in = 0;
    kernel<<<1,1>>>(x, y, z, pred_in);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}
1 Like

Seems the only way to define a pred register is using multi-line ptx by .reg .pred instead of single-line ptx.
I tried your code at my plantform and found /*0080*/ LOP3.LUT P0, R2, R8, c[0x0][0x164], R10, 0x80, P0 ; /* 0x0000590008027a12 */

From the generated SASS it is clear that the disassembler treats POR as the default mode, and PAND as the alternate mode when .BoolOp is used.

I saw the same situation and found the fifth operand in new assembly result is P0 defined earlier instead of PT. I think nv has done some functionality extension here for lop3 and is compatible with older implementations.
Although it doesnā€™t completely solve my problem during studying the fast s32 div algorithm by f32 rcp, but let me be more clear about the operand rules for lop3, thanks again and wish you have a nice day :)

The PTX manual says the .BoolOp variant was introduced with sm_70, whereas lop3 was introduced with sm_50 if I recall correctly. However, the .BoolOp variant was not exposed at the PTX level until recently, presumably after NVIDIA decided it was going to support this functionality in hardware indefinitely. The issue with exposing HW functionality in PTX too quickly is that if hardware support is removed, you now have to provide emulation going forward. That has already happened several times in the past, and it constitutes an ongoing cost.

The int division you are looking at does not use the .BoolOp variant of lop3. Presumably all this initial lop3 does is XOR the two source operands to establish the sign of the result. Then it proceeds to map s32 division to u32 division by taking the absolute value of dividend and remainder.

The integer division emulation for CUDA has gone through several versions over the years, several of which I created while I was at NVIDIA. The basic idea is to get a reciprocal estimate via MUFU.RCP, then perform Newton or Halley iteration(s) to refine it, back-multiply with the dividend to get a preliminary quotient, determine the corresponding remainder and finalize the quotient based on that. I would assume this basic structure is still in place, but NVIDIA engineers have probably tweaked and tuned the details since I retired from the company in 2014.

Integer division in CUDA is actually not particularly fast, as there is no dedicated hardware of any kind to accelerate it and everything needs to be emulated using ordinary instructions. The existence of MUFU.RCP makes performance acceptable: the performance is OK for 32-bit divisions, but a bit ā€œmeh!ā€ for 64-bit divisions.

1 Like

Thatā€™s right and generally speaking, the hw resources of the divider are truly expensive. Mostly the division algorithm is getting an approximate solution and using iteration to fix the precision. I just wonder how to determine the number of iterations and how to ensure the stability of the division result as the accuracy of MUFU.RCP precision changes.
Perhaps both of these points can be guaranteed by error analysis, I need to refer to more related books.