PTX source for vsubss4 gave different results than manual

I checked the PTX source for vsubss4 and the results are different from the document, do you know why?
The document was listed as “vsub4.s32”.
But, The PTX source I tried was listed as “sub.u32”.

Referenced document: PTX ISA 8.5 (nvidia.com)

The source, command line, and PTX source results are below.

#include<stdio.h>

__global__ void hello(){
        unsigned int  a, b, c;

        a = 0x09080706;
        b = 0x01020304;

        c = __vsubss4(a, b);

        printf("result: %08x\n", c);
}

int main(){
        hello<<< 1, 1 >>>();
        cudaThreadSynchronize();

        return 0;
}
nvcc -ptx ./hello_test.cu -o test.ptx
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-33191640
// Cuda compilation tools, release 12.2, V12.2.140
// Based on NVVM 7.0.1
//

.version 8.2
.target sm_52
.address_size 64

        // .globl       _Z5hellov
.extern .func  (.param .b32 func_retval0) vprintf
(
        .param .b64 vprintf_param_0,
        .param .b64 vprintf_param_1
)
;
.global .align 1 .b8 $str[14] = {114, 101, 115, 117, 108, 116, 58, 32, 37, 48, 56, 120, 10};

.visible .entry _Z5hellov()
{
        .local .align 8 .b8     __local_depot0[8];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .b32       %r<5>;
        .reg .b64       %rd<5>;


        mov.u64         %SPL, __local_depot0;
        cvta.local.u64  %SP, %SPL;
        add.u64         %rd1, %SP, 0;
        add.u64         %rd2, %SPL, 0;
        mov.u32         %r2, 151521030;
        mov.u32         %r3, 16909060;
        // begin inline asm
        {
        .reg .u32 a,b,r,s,t,u,v,w;
        mov.b32     a,%r2;
        mov.b32     b,%r3;
        not.b32     u,b;
        xor.b32     s,u,a;
        or.b32      r,a,0x80808080;
        and.b32     t,b,0x7f7f7f7f;
        sub.u32     r,r,t;
        xor.b32     t,r,a;
        not.b32     u,s;
        and.b32     s,s,0x80808080;
        xor.b32     r,r,s;
        and.b32     t,t,u;
        prmt.b32    s,a,0,0xba98;
        xor.b32     s,s,0x7f7f7f7f;
        prmt.b32    t,t,0,0xba98;
        and.b32     s,s,t;
        not.b32     t,t;
        and.b32     r,r,t;
        or.b32      r,r,s;
        mov.b32     %r1,r;
        }
        // end inline asm
        st.local.u32    [%rd2], %r1;
        mov.u64         %rd3, $str;
        cvta.global.u64         %rd4, %rd3;
        { // callseq 0, 0
        .reg .b32 temp_param_reg;
        .param .b64 param0;
        st.param.b64    [param0+0], %rd4;
        .param .b64 param1;
        st.param.b64    [param1+0], %rd1;
        .param .b32 retval0;
        call.uni (retval0),
        vprintf,
        (
        param0,
        param1
        );
        ld.param.b32    %r4, [retval0+0];
        } // callseq 0
        ret;

}

__vsubss4() is a CUDA device function intrinsic. NVIDIA does not specify that this intrinsic maps to the PTX instruction vsub.s32.s32.s32.sat. As you can see from the generated PTX, __vsubss4() instead maps to a longish emulation sequence using ordinary integer instructions. What you would really want to do is look at the generated SASS (machine code), e.g. from cuobjdump --dump-sass.

If you want to try the PTX instruction itself, you can code:

unsigned int a, b, r;
unsigned int c = 0;
asm ("vsub.s32.s32.s32.sat %0,%1,%2,%3;\n\t" : "=r"(r) : "r"(a), "r"(b),  "r"(c));

When you look at the SASS generated for that, it will likely look similar to the emulation sequence code emitted for the intrinsic.

Why are there discrepancies? For Compute Capability 3.0 NVIDIA added a bunch of SIMD-in-register hardware instructions. Most of these were removed in subsequent GPU architectures. Today only a couple survive. Most (maybe all; I don’t recall) of the functionality of the hardware instructions was exposed at the PTX level. The CUDA device function intrinsics generally only exposed part of the underlying PTX-level functionality.

Once SIMD-in-register instructions were removed from the hardware, the PTX-level functionality needed to be emulated. Since most of the hardware instructions were quite versatile, with several functional modifiers, the emulation sequences tended to be slow. However, many intrinsics could be emulated at higher performance due to their reduced functionality relative to the PTX instructions.

For this reason, most of the SIMD-in-register device function intrinsics map directly to emulation code, rather than generating a corresponding PTX instruction which in turn would frequently be emulated at SASS level.

The lesson NVIDIA learned from this is that one would not want to expose hardware features at PTX level unless these hardware features are likely to be stable across several hardware generations. Otherwise, the already fairly large number of PTX instructions without direct hardware support continues to grow, increasing the maintenance burden for emulation code.

The fact that most SIMD-in-register intrinsics are emulated does not mean that they are useless at this time. Many of the emulation sequences are quite efficient. The original hardware instructions executed at 1/4 the throughput of simple integer instructions, so when some simple SIMD intrinsic requires a 5-instruction emulation sequence, it is almost as fast. Generally, one would be hard pressed to find alternative computations with identical functionality at better performance.

By all means evaluate the utility of the SIMD-in-register intrinsics carefully in the context of your specific use case.

1 Like

To give an idea about the efficiency of saturated SIMD arithmetic, here is code I wrote for vaddss4 some years back (sorry, I could not locate a vsubss4 implementation just now; I may or may not have one somewhere).

[code below updated 9/28/2024]

/*
  Copyright 2019-2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#include <stdint.h>

#if (__CUDACC__)
#define __HOST__ __host__
#define __DEVICE__ __device__
#else // __CUDACC__
#define __HOST__
#define __DEVICE__
#endif // __CUDACC__

__HOST__ __DEVICE__ uint32_t lop3_f4 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0xf4;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = a | (b & ~c);
#endif // __CUDA_ARCH__
    return r;
} 

__HOST__ __DEVICE__ uint32_t lop3_28 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0x28;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = (a ^ b) & c;
#endif // __CUDA_ARCH__
    return r;
}

__HOST__ __DEVICE__ uint32_t lop3_d2 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0xd2;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = a ^ (~b & c);
#endif // __CUDA_ARCH__ 
    return r;
}

__HOST__ __DEVICE__ uint32_t lop3_14 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0x14;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = (a ^ b) & ~c;
#endif // __CUDA_ARCH__
    return r;
}

#define MSB_MASK (0x80808080U)  // mask for msb of each byte

__HOST__ __DEVICE__ uint32_t masked_sign_to_byte_mask (uint32_t a)
{
#if (__CUDA_ARCH__ >= 200)
    asm ("prmt.b32 %0,%0,0,0xba98;" : "+r"(a)); // convert MSBs to masks
#else
    a = a & MSB_MASK;
    a = a + a - (a >> 7); // extend MSBs to full byte to create mask
#endif
    return a;
}

__HOST__ __DEVICE__ uint32_t vaddss4 (uint32_t a, uint32_t b)
{
    uint32_t sum, res, ofl, sga, msk;
    res = (a & ~MSB_MASK) + (b & ~MSB_MASK);
    sum = a ^ b;
    ofl = lop3_14 (res, a, sum); // ofl = (res ^ a) & ~sum
    sga = masked_sign_to_byte_mask (a);  // sign(a)-mask
    msk = masked_sign_to_byte_mask (ofl);// overflow-mask
    res = lop3_d2 (res, ~MSB_MASK, sum); // res = res ^ (MSB_MASK & sum)
    sga = lop3_28 (sga, ~MSB_MASK, msk); // sga = (sga ^ ~MSB_MASK) & msk
    res = lop3_f4 (sga, res, msk);       // res = sga | (res & ~msk)
    return res;
}

For an sm_80 target, this compiles to a 10-instruction sequence (extracted from Compiler Explorer):

vaddss4(unsigned int, unsigned int):
 LOP3.LUT R3, R4, 0x7f7f7f7f, RZ, 0xc0, !PT 
 LOP3.LUT R6, R5.reuse, 0x7f7f7f7f, RZ, 0xc0, !PT 
 LOP3.LUT R0, R5, R4, RZ, 0x3c, !PT 
 IMAD.IADD R3, R3, 0x1, R6 
 LOP3.LUT R5, R3, R4, R0.reuse, 0x14, !PT 
 PRMT R4, R4, 0xba98, RZ 
 PRMT R5, R5, 0xba98, RZ 
 LOP3.LUT R0, R3, 0x7f7f7f7f, R0, 0xd2, !PT 
 LOP3.LUT R4, R4, 0x7f7f7f7f, R5, 0x28, !PT 
 LOP3.LUT R4, R4, R0, R5, 0xf4, !PT 
 RET.ABS.NODEC R20 0x0    
1 Like

It seems I never got around to implementing vsubss4 five years ago, as I could find no corresponding code in my collection. I did code up vsubus4, i.e. byte-wise subtraction with unsigned saturation, which results in an efficient 7-instruction sequence. Based on this and the previous post I think it is safe to summarize that emulation-based SIMD intrinsics for saturated arithmetic are fast enough to be of practical utility. Any missed optimization opportunities in the emulation sequences should be brought to NVIDIA’s attention via an enhancement request (RFE).

/*
  Copyright 2019, Norbert Juffa

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#include <stdint.h>

#if (__CUDACC__)
#define __HOST__ __host__
#define __DEVICE__ __device__
#else // __CUDACC__
#define __HOST__
#define __DEVICE__
#endif // __CUDACC__

__HOST__ __DEVICE__ uint32_t lop3_b2 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0xb2;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = (a | c) & ((a & c) | ~b);
#endif // __CUDA_ARCH__
    return r;
} 

__HOST__ __DEVICE__ uint32_t lop3_82 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0x82;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = ~(a ^ b) & c;
#endif // __CUDA_ARCH__
    return r;
} 

__HOST__ __DEVICE__ uint32_t lop3_28 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0x28;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = (a ^ b) & c;
#endif // __CUDA_ARCH__
    return r;
} 

#define MSB_MASK (0x80808080U)  // mask for msb of each byte

__HOST__ __DEVICE__ uint32_t vsubus4 (uint32_t x, uint32_t y)
{
    uint32_t a, b, r, s, t;
    a = x | MSB_MASK;
    b = y &~ MSB_MASK;
    r = a - b;
    s = lop3_82 (x, y, MSB_MASK);
    t = lop3_b2 (x, y, r);
    t = masked_sign_to_byte_mask (t);
    r = lop3_28 (r, s, t);
    return r;
}

Compiled for an sm_80 target, this results in a 7-instruction sequence (as extracted from Compiler Explorer):

vsubus4(unsigned int, unsigned int):
 LOP3.LUT R0, R4, 0x80808080, RZ, 0xfc, !PT 
 LOP3.LUT R3, R5, 0x7f7f7f7f, RZ, 0xc0, !PT 
 IMAD.IADD R0, R0, 0x1, -R3 
 LOP3.LUT R3, R4.reuse, R5, R0, 0xb2, !PT 
 LOP3.LUT R5, R4, 0x80808080, R5, 0x84, !PT 
 PRMT R4, R3, 0xba98, RZ 
 LOP3.LUT R4, R0, R5, R4, 0x28, !PT 
 RET.ABS.NODEC R20 0x0 

[Later:]

I went ahead and programmed up vsubss. This results in a 10-instruction sequence when compiled for an sm_80 target. I note that CUDA’s built-in intrinsics __vaddss4, __vsubss4, __vsubus4produce instruction sequences with 10, 10, and 7 instructions, respectively. While the generated code for the intrinsics differs a bit from what is being generated for the source code presented here, it seems reasonable to conclude that the CUDA intrinsics enumerated above are “optimal”, given that the number of instructions of corresponding emulation sequences matches.

I further note that I did not run into issues with the CUDA compiler not delivering the desired LOP3s in the code for vsubss4, while in the code for vadddss4 and vsubus4 from five years ago I apparently resorted to hard coded LOP3 instances. One might see this as anecdotal evidence that the compiler’s ability to generated “optimal” sequences of LOP3s has improved.

/*
  Copyright (c) 2024, Norbert Juffa

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#include <stdint.h>

#if (__CUDACC__)
#define __HOST__ __host__
#define __DEVICE__ __device__
#else // __CUDACC__
#define __HOST__
#define __DEVICE__
#endif // __CUDACC__

__HOST__ __DEVICE__ uint32_t lop3_e4 (uint32_t a, uint32_t b, uint32_t c)
{
    uint32_t r;
#if (__CUDA_ARCH__ >= 500)
    asm ("lop3.b32 %0,%1,%2,%3,0xe4;\n\t" : "=r"(r) : "r"(a), "r"(b), "r"(c));
#else // __CUDA_ARCH__
    r = (a & c) | (b & ~c);
#endif // __CUDA_ARCH__
    return r;
} 

__HOST__ __DEVICE__ uint32_t masked_select (uint32_t a, uint32_t b, uint32_t m)
{
#if (__CUDA_ARCH__ >= 500) 
    return lop3_e4 (a, b, m);
#elif 0
    return (((a)&(m))|((b)&(~(m))));
#else
    return((((a)^(b))&(m))^(b));
#endif
}

#define MSB_MASK (0x80808080U)  // mask for msb of each byte

__HOST__ __DEVICE__ uint32_t masked_sign_to_byte_mask (uint32_t a)
{
#if (__CUDA_ARCH__ >= 200)
    asm ("prmt.b32 %0,%0,0,0xba98;\n\t" : "+r"(a)); // convert MSBs to masks
#else
    a = a & MSB_MASK;
    a = a + a - (a >> 7); // extend MSBs to full byte to create mask
#endif
    return a;
}

__HOST__ __DEVICE__ uint32_t vsubss4 (uint32_t a, uint32_t b)
{
    uint32_t r, s, t;
    r = a | MSB_MASK;                     // LOP3
    s = b &~ MSB_MASK;                    // LOP3
    r = r - s;                            // IADD
    t = (b ^ a) & MSB_MASK;               // LOP3 
    s = t & (r ^ a);                      // LOP3 // msb indicates overflow
    r = r ^ t ^ MSB_MASK;                 // LOP3 // regular result
    t = masked_sign_to_byte_mask (s);     // PRMT
    s = ~MSB_MASK + ((a & MSB_MASK) >> 7);// LOP3, LEA // potential special res.
    r = masked_select (s, r, t);          // LOP3 //select spec. or reg. result
    return r;
}

The source above truen into the following when compiled for an sm_80 target (extracted from Compiler Explorer):

vsubss4(unsigned int, unsigned int):
 LOP3.LUT R3, R4.reuse, 0x80808080, RZ, 0xfc, !PT 
 LOP3.LUT R6, R5.reuse, 0x7f7f7f7f, RZ, 0xc0, !PT 
 LOP3.LUT R0, R5, 0x80808080, R4, 0x48, !PT 
 LOP3.LUT R5, R4, 0x80808080, RZ, 0xc0, !PT 
 IMAD.IADD R3, R3, 0x1, -R6 
 LEA.HI R5, R5, 0x7f7f7f7f, RZ, 0x19 
 LOP3.LUT R4, R0.reuse, R3, R4, 0x60, !PT 
 LOP3.LUT R0, R0, 0x80808080, R3, 0x96, !PT 
 PRMT R3, R4, 0xba98, RZ 
 LOP3.LUT R4, R5, R0, R3, 0xe4, !PT 
 RET.ABS.NODEC R20 0x0 
2 Likes