What does LOP3.LUT mean? How is it executed?

I see a lot of this instruction in SASS code.
for example:

            LOP3.LUT R45, R42, R19, RZ, 0x3c, !PT 

It seems the R45 register is output and the others are input.
How the output is calculated?

Its SASS, so its not well documented. However you can get some insight by studying the corresponding PTX instruction. This may be of interest.

Thanks a lot.

Another question,
Because ALU is very busy(98% utilized), I want to move some operation from ALU to FMA.
In my code, one heat point is:

a = a + b + c

each variable is an unsigned int.
The compiler use IADD3 to calculate the sum which is executed in ALU. I want to use two IMAD instruction to calculate the sum so that some work can be moved from ALU to FMA. I wrote ptx as below:

__device__ uint32_t add(uint32_t a, uint32_t b, uint32_t c){
    uint32_t ret = 0;
    asm("mad.lo.u32 %0, %1, 1, %2;"
        "mad.lo.u32 %0, %0, 1, %3;"
        :"+r"(ret):"r"(a), "r"(b), "r"(c));
    return ret;
}

However, the add() function was still optimized to use single IADD3 instruction. How can I force two mad instruction be used?

I don’t know how to force the compiler to generate IMAD instead of IADD3. I guess you could try various ptxas optimization levels but that is just likely to crater your performance. Another possibility might be to pass the multiplier of 1 in such a way (e.g. via a register) that the compiler cannot discover it as a compile-time constant:

__device__ uint32_t add(uint32_t a, uint32_t b, uint32_t c, uint32_t m){
    uint32_t ret = 0;
    asm("mad.lo.u32 %0, %1, %4, %2;"
        "mad.lo.u32 %0, %0, %4, %3;"
        :"+r"(ret):"r"(a), "r"(b), "r"(c), "r"(m));
    return ret;
}

however its going to take more than just that change. You might have to really obfuscate the 1 value (like, pass it as a kernel argument) or else use rdc device-code linking, and keep the constant 1 in a separate compilation unit from the above function.

Later:

compiling the above code for cc8.6, and “hiding” the 1 multiplier, can result in two IMAD instructions:

$ cat t2107.cu
#include <cstdint>
__device__ uint32_t add(uint32_t a, uint32_t b, uint32_t c, uint32_t m){
    uint32_t ret = 0;
    asm("mad.lo.u32 %0, %1, %4, %2;"
        "mad.lo.u32 %0, %0, %4, %3;"
        :"+r"(ret):"r"(a), "r"(b), "r"(c), "r"(m));
    return ret;
}

__global__ void k(uint32_t a, uint32_t b, uint32_t c, uint32_t m, uint32_t *r){

  *r = add(a, b, c, m);
}


$ nvcc -c t2107.cu -arch=sm_86
$ cuobjdump -sass t2107.o

Fatbin elf code:
================
arch = sm_86
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_86
                Function : _Z1kjjjjPj
        .headerflags    @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
        /*0000*/                   MOV R1, c[0x0][0x28] ;            /* 0x00000a0000017a02 */
                                                                     /* 0x000fc40000000f00 */
        /*0010*/                   MOV R5, c[0x0][0x16c] ;           /* 0x00005b0000057a02 */
                                                                     /* 0x000fe20000000f00 */
        /*0020*/                   ULDC.64 UR4, c[0x0][0x118] ;      /* 0x0000460000047ab9 */
                                                                     /* 0x000fe20000000a00 */
        /*0030*/                   MOV R0, c[0x0][0x160] ;           /* 0x0000580000007a02 */
                                                                     /* 0x000fe40000000f00 */
        /*0040*/                   MOV R2, c[0x0][0x170] ;           /* 0x00005c0000027a02 */
                                                                     /* 0x000fe40000000f00 */
        /*0050*/                   MOV R3, c[0x0][0x174] ;           /* 0x00005d0000037a02 */
                                                                     /* 0x000fe20000000f00 */
        /*0060*/                   IMAD R0, R5, R0, c[0x0][0x164] ;  /* 0x0000590005007624 */
                                                                     /* 0x000fc800078e0200 */
        /*0070*/                   IMAD R5, R0, R5, c[0x0][0x168] ;  /* 0x00005a0000057624 */
                                                                     /* 0x000fca00078e0205 */
        /*0080*/                   STG.E [R2.64], R5 ;               /* 0x0000000502007986 */
                                                                     /* 0x000fe2000c101904 */
        /*0090*/                   EXIT ;                            /* 0x000000000000794d */
                                                                     /* 0x000fea0003800000 */
        /*00a0*/                   BRA 0xa0;                         /* 0xfffffff000007947 */
                                                                     /* 0x000fc0000383ffff */
        /*00b0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*00c0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*00d0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*00e0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*00f0*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0100*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0110*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0120*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0130*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0140*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0150*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0160*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
        /*0170*/                   NOP;                              /* 0x0000000000007918 */
                                                                     /* 0x000fc00000000000 */
                ..........



Fatbin ptx code:
================
arch = sm_86
code version = [7,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$
1 Like

LOP3.LUT is an instruction that can compute any logical operation with three inputs, using a lookup table (LUT) to do so. The basic idea is similar to that of FPGAs using complex building blocks (CLBs), some of which can compute any logical operation of up to five inputs.

The introduction of LOP3 was a smart move on the part of NVIDIA by taking full advantage of a data path already designed for three-input operations, as it has to support FMA operations. Therefore also the existence of IADD3, which is only incrementally harder to implement than regular addition (simply add a carry-save adder in front of the carry-propagate adder), and the prevalence of IMAD instructions in code generated for Ampere.

The 8-bit lookup table (truth table) for LOP.LUT3 is specified in the instruction encoding itself, and is not difficult to reverse engineer. While LOP3.LUT is great for performance, and the CUDA compiler does a pretty good (but often not perfect where there is a sequences of LOP3s) job of exploiting it, it makes reading SASS (machine code, e.g. output of cuobjdump --dump-sass) completely non-intuitive and a major pain in the behind.

Here are two LOP3 emulations:

/* emulate GPU's LOP3.LUT (three-input logic op with 8-bit truth table) */
uint32_t lop3_ref (uint32_t a, uint32_t b, uint32_t c, uint8_t ttbl)
{
    uint32_t r = 0;
    for (int i = 0; i < 32; i++) {
        uint32_t bitidx = ((a & 1) << 2) | ((b & 1) << 1) | (c & 1);
        uint32_t tblbit = (ttbl >> bitidx) & 1;
        r = r | (tblbit << i);
        a >>= 1;
        b >>= 1;
        c >>= 1;
    }
    return r;
}

/* emulate GPU's LOP3.LUT (three-input logic op with 8-bit truth table) */
uint32_t lop3_fast (uint32_t a, uint32_t b, uint32_t c, uint8_t ttbl)
{
    uint32_t r = 0;
    if (ttbl & 0x01) r |= ~a & ~b & ~c;
    if (ttbl & 0x02) r |= ~a & ~b &  c;
    if (ttbl & 0x04) r |= ~a &  b & ~c;
    if (ttbl & 0x08) r |= ~a &  b &  c;
    if (ttbl & 0x10) r |=  a & ~b & ~c;
    if (ttbl & 0x20) r |=  a & ~b &  c;
    if (ttbl & 0x40) r |=  a &  b & ~c;
    if (ttbl & 0x80) r |=  a &  b &  c;
    return r;
}
1 Like

As for the quest to replace one IADD3 by two IMADs, the fact that ptxas transforms the latter into the former is a pretty good indication that this is not a performance win as hoped for. ptxas is not generally shy to use IMAD, which appears to be a high-throughput instruction in Ampere family GPUs.

Please note that, despite the name, ptxas is an optimizing compiler, and trying to coerce it into generating any particular SASS by manipulating PTX code is extremely brittle and likely doomed to failure.

It works. Thanks a lot.

As Robert suggested, I moved calculation from ALU to FMU and increased performance a lot. In most cases, using two IMAD instructions to calculate sum of three integers is worse than using single IADD3 instruction, so the compiler always try to using single IADD3 instruction. However in my case, as ALU is too busy, it’s beneficial to use two FMU instruction instead of single ALU instruction. Thanks for your response.

1 Like