Performance Difference DP4A vs VMIN4 using GTX1060

Hi everyone,

I benchmarked the DP4A operation and VMIN4 operation on int8_t datatype (int32_t accumulate) using the CUTLASS library, so trying to fully take advantage of tiled structures. The DP4A and VMIN4 instructions both operate on 2 vectors each containing 4 8-bit signed integers.

Code (VMIN4): https://github.com/TheNewSound/cutlass/blob/matmin/examples/00_basic_gemm/basic_gemm.cu
Code (DP4A): https://github.com/TheNewSound/cutlass/blob/original/examples/00_basic_gemm/basic_gemm.cu

I would assume VMIN4 to be faster than DP4A since min() is an easier operation to perform on 2 integers than a multiplication?

But these are my results:


Ran on a GTX1060 3GB

What could be causing this difference? Why is DP4A (matrix multiplication) faster than VMIN4?

In no particular order:

  1. It’s not clear to me that either the vmin4 or dp4a would be used in either of the codes you have presented. Have you checked?
  2. dp4a compiles to a single SASS instruction on cc6.1. I’m not sure vmin4 does.
  3. “min() is an easier operation to perform on 2 integers than a multiplication” – I’m not sure how you reached that conclusion.
  4. this isn’t really using the cutlass library. The ReferenceGemm_kernel is provided as a reference, not as a principal library routine.
  1. I have checked that VMIN4 and DP4A are used, in the code they can be found in: https://github.com/TheNewSound/cutlass/blob/matmin/include/cutlass/arch/mma_sm61.h and https://github.com/TheNewSound/cutlass/blob/original/include/cutlass/arch/mma_sm61.h
  2. The number of SASS instructions could indeed be the cause, thanks for pointing this out, I will check that.
  3. If the unsigned 8-bit integer is processed as one’s or two’s complement, Its as easy as checking those two 8 bits integers for the first occurrence of a 1 (in the least significant 7 bits), then return that integer (if the most significant bits are equal). If the most significant bit is not equal, the result is clear immediately. unsigned 8 bit integer multiplication is less straightforward?
  4. I do not benchmark the ReferenceGemm_kernel, that code is ran to determine if the output of the CutlassSgemmNN kernel is correct (it should equal the output of the ReferenceGemm_kernel). I benchmark the CutlassSgemmNN code.
  1. I’ve checked just now. vmin4 seems to compile to a SASS sequence (not terribly long, but its not a single instruction). dp4a compiles to a single instruction (on cc6.1). There would certainly be a performance difference favoring dp4a due to this, but I’m not sure it accounts for what you are showing which is well over an order of magnitude difference.
  2. I would suggest you dispense with this line of reasoning and instead ask what operations (i.e. instructions) actually get performed at the device code level, and what is the throughput of these instructions. You might think that byte-wise multiplication (dp4a) is “hard”, however on cc6.1 there is dedicated hardware and instruction support for it. On an ops/s basis, on cc6.1, it actually has substantially higher throughput than any comparable math (e.g. 32-bit integer multiply, 16-bit floating point multiply, 32-bit floating point multiply, 64-bit floating point multiply)

Would you know of a method on how to correctly measure throughput of these two instructions? I think with the modified CUTLASS code I came relatively close already? They operate on the same input matrices.

Actually I should probably amend my previous statement. When I first looked at vmin4, I was using the intrinsic. The compiled sequence didn’t seem that long to me (looking again now, it seems to be ~10 instructions). However when I switched to the inline assembly version indicated in your link, the sequence is a bit longer. I don’t know for sure, but I think it’s possible that it could give rise to an order of magnitude difference between dp4a and vmin4 performance. Again, the dp4a compiles to basically a single instruction. It looks like the inline assembly vmin4 is compiling to ~20 instructions.

In any event, without getting to a calibrated measurement, I’m pretty confident in the idea that I would expect dp4a to be faster than vmin4, based on studying the SASS.

I don’t know what modifications you made to the cutlass code, nor do I know what test case you ran, so I can’t comment further.

(my initial read was not very careful - even at 10 instructions to 1, its reasonable to conclude that there might be an order of magnitude difference)

Let me state it again: dp4a is the highest math operation throughput in ops/s on a cc6.1 architecture. In a nutshell, it runs at 4x the FP32 rate (again, when viewed as ops/s). I don’t think you’re going to find any other math operation of any type that will be faster.

Thanks! I analyzed SASS output for different SM architectures (lets have a look at SM_30 and SM_61). ONLY in SM_30 the code compiles to a single VMNMX instruction (very efficient I suppose), but after SM_30 (like SM_50 and SM_61) it compiles to a big sequence using IMNMX instructions. Is this because in hardware with CC >= 5.0 the support for the VMNMX device instruction is dropped? If that would be the case, it would be impossible to run SM_30 VMIN4 code on SM_50 and later, right?

CUDA (it is very simple code):


SM_30 output (! pay attention to the single VMNMX instruction):

SM_61 output (! much longer sequence of instructions for VMIN4):

Is this because the compiler doesnt optimize VMIN4 correctly for SM_61?

Yes, and yes.

For the first question, to be a bit more general I would say that various video SIMD PTX instructions may compile to a single SASS instruction on some architectures whereas they may compile to instruction sequences on others. This is documented here (specifically the line “All other SIMD video instructions” for the case discussed here. The numerical value of 32 for the Kepler compute capability indicates that there is a single SASS instruction for it as well as a functional unit in the SM designated to handle that instruction)

For the second question, compute_30 (i.e. PTX) code could be run on a future architecture but sm_30 (i.e. SASS) code cannot. compute_30 PTX may contain a single SIMD video opcode, but when JIT-compiled to SASS for another architecture, it will translate into a SASS sequence, not a single instruction.

No, the difference between the compiler emitting a single instruction in the sm_30 case, and an instruction sequence in the sm_61 case is for the reason already covered above. There is a difference in the hardware support between these two architectures.

Very interesting! Thanks a lot for the help!

If you are interested in the vmin4 case, you might want to experiment with the intrinsic, rather than using inline assembly. According to my testing, the intrinsics generate less SASS code. I’m not sure why that would be the case. It’s also possible that I may have made some mistake in my observations.

The __vmins4 intrinsic generates less SASS code because it does not perform the addition I guess? To get the same operation done as vmin4.s32.s32.s32.add PTX one needs to perform __vmins4 followed by something that sums 4 8-bit integers to a single 32-bit integer. I think i will test performance using those 2 setups (current vmin4.s32.s32.s32.add PTX setup versus those SIMD intrinsics).

Yes, I had forgotten about the summing operations.

Back with news! I updated my VMIN4 code with vmins4 intrinsic and ran the benchmark again, and I got a 2 times performance improvement for every matrix size on GTX1060!! nice!!!

See the lines of code here:

This does however mean that vmin4.s32.s32.s32.add PTX code is not compiled to device code efficiently. See here the difference with using intrinsics:

	Function : _Z5vmin4RiRKjS1_RKi
.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                      /* 0x001f8800fc2007f6 */
    /*0008*/                   MOV R1, c[0x0][0x20] ;                 /* 0x4c98078000870001 */
    /*0010*/                   MOV R2, c[0x0][0x148] ;                /* 0x4c98078005270002 */
    /*0018*/                   MOV R3, c[0x0][0x14c] ;                /* 0x4c98078005370003 */
                                                                      /* 0x001f8800fe2007f0 */
    /*0028*/         {         MOV R4, c[0x0][0x150] ;                /* 0x4c98078005470004 */
    /*0030*/                   LDG.E R2, [R2]         }
                                                                      /* 0xeed4200000070202 */
    /*0038*/                   MOV R5, c[0x0][0x154] ;                /* 0x4c98078005570005 */
                                                                      /* 0x001f8800fc2007b1 */
    /*0048*/                   LDG.E R4, [R4] ;                       /* 0xeed4200000070404 */
    /*0050*/                   MOV R6, c[0x0][0x158] ;                /* 0x4c98078005670006 */
    /*0058*/                   MOV R7, c[0x0][0x15c] ;                /* 0x4c98078005770007 */
                                                                      /* 0x281fc400fe6007b1 */
    /*0068*/                   LDG.E R6, [R6] ;                       /* 0xeed4200000070606 */
    /*0070*/                   DEPBAR.LE SB5, 0x1 ;                   /* 0xf0f0000034170000 */
    /*0078*/                   PRMT R0, R2.reuse, 0x3210, R4.reuse ;  /* 0x36c0020321070200 */
                                                                      /* 0x001f8400fe2007f5 */
    /*0088*/                   PRMT R8, R2, 0x7654, R4 ;              /* 0x36c0020765470208 */
    /*0090*/                   PRMT R9, R0, 0x7650, RZ ;              /* 0x36c07f8765070009 */
    /*0098*/                   PRMT R11, R8, 0x7650, RZ ;             /* 0x36c07f876507080b */
                                                                      /* 0x081fc400fe2207f4 */
    /*00a8*/                   PRMT R3, R8.reuse, 0x7651, RZ ;        /* 0x36c07f8765170803 */
    /*00b0*/                   BFE R10, R9, 0x800 ;                   /* 0x380100008007090a */
    /*00b8*/                   PRMT R9, R0.reuse, 0x7651, RZ ;        /* 0x36c07f8765170009 */
                                                                      /* 0x001f8400fe2007f1 */
    /*00c8*/                   PRMT R4, R8, 0x7652, RZ ;              /* 0x36c07f8765270804 */
    /*00d0*/                   PRMT R2, R0, 0x7652, RZ ;              /* 0x36c07f8765270002 */
    /*00d8*/                   BFE R11, R11, 0x800 ;                  /* 0x3801000080070b0b */
                                                                      /* 0x001fc400fe2007f1 */
    /*00e8*/                   PRMT R8, R8, 0x7653, RZ ;              /* 0x36c07f8765370808 */
    /*00f0*/                   BFE R3, R3, 0x800 ;                    /* 0x3801000080070303 */
    /*00f8*/                   PRMT R0, R0, 0x7653, RZ ;              /* 0x36c07f8765370000 */
                                                                      /* 0x001f8400fe2007f2 */
    /*0108*/                   BFE R9, R9, 0x800 ;                    /* 0x3801000080070909 */
    /*0110*/                   IMNMX R11, R10, R11, PT ;              /* 0x5c21038000b70a0b */
    /*0118*/                   BFE R4, R4, 0x800 ;                    /* 0x3801000080070404 */
                                                                      /* 0x001fc400fe2007f1 */
    /*0128*/                   BFE R5, R8, 0x800 ;                    /* 0x3801000080070805 */
    /*0130*/                   BFE R2, R2, 0x800 ;                    /* 0x3801000080070202 */
    /*0138*/                   BFE R0, R0, 0x800 ;                    /* 0x3801000080070000 */
                                                                      /* 0x001f8400fe2007f4 */
    /*0148*/                   IMNMX R9, R9, R3, PT ;                 /* 0x5c21038000370909 */
    /*0150*/                   IMNMX R4, R2, R4, PT ;                 /* 0x5c21038000470204 */
    /*0158*/                   IMNMX R5, R0, R5, PT ;                 /* 0x5c21038000570005 */
                                                                      /* 0x001fd000fc2107e1 */
    /*0168*/                   IADD3 R6, R9, R6, R11 ;                /* 0x5cc0058000670906 */
    /*0170*/                   MOV R2, c[0x0][0x140] ;                /* 0x4c98078005070002 */
    /*0178*/                   MOV R3, c[0x0][0x144] ;                /* 0x4c98078005170003 */
                                                                      /* 0x001ffc00fe2007e2 */
    /*0188*/                   IADD3 R4, R5, R6, R4 ;                 /* 0x5cc0020000670504 */
    /*0190*/                   STG.E [R2], R4 ;                       /* 0xeedc200000070204 */
    /*0198*/                   EXIT ;                                 /* 0xe30000000007000f */
                                                                      /* 0x001f8000fc0007ff */
    /*01a8*/                   BRA 0x1a0 ;                            /* 0xe2400fffff07000f */
    /*01b0*/                   NOP;                                   /* 0x50b0000000070f00 */
    /*01b8*/                   NOP;                                   /* 0x50b0000000070f00 */
	..............................


	Function : _Z15vmin4_intrinsicRiRKjS1_RKi
.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                /* 0x001f8800fc2007f6 */
    /*0008*/                   MOV R1, c[0x0][0x20] ;           /* 0x4c98078000870001 */
    /*0010*/                   MOV R4, c[0x0][0x148] ;          /* 0x4c98078005270004 */
    /*0018*/                   MOV R5, c[0x0][0x14c] ;          /* 0x4c98078005370005 */
                                                                /* 0x001f8800f62007f0 */
    /*0028*/         {         MOV R6, c[0x0][0x150] ;          /* 0x4c98078005470006 */
    /*0030*/                   LDG.E R4, [R4]         }
                                                                /* 0xeed4200000070404 */
    /*0038*/                   MOV R7, c[0x0][0x154] ;          /* 0x4c98078005570007 */
                                                                /* 0x001f8800fc2007b1 */
    /*0048*/                   LDG.E R6, [R6] ;                 /* 0xeed4200000070606 */
    /*0050*/                   MOV R2, c[0x0][0x140] ;          /* 0x4c98078005070002 */
    /*0058*/                   MOV R3, c[0x0][0x144] ;          /* 0x4c98078005170003 */
                                                                /* 0x001f8400fe6007b1 */
    /*0068*/                   LDG.E R10, [R2] ;                /* 0xeed420000007020a */
    /*0070*/                   DEPBAR.LE SB5, 0x1 ;             /* 0xf0f0000034170000 */
    /*0078*/                   LOP32I.AND R9, R4, 0x7f7f7f7f ;  /* 0x0407f7f7f7f70409 */
                                                                /* 0x001f8400fcc007e6 */
    /*0088*/                   LOP32I.OR R0, R6, 0x80808080 ;   /* 0x0428080808070600 */
    /*0090*/                   IADD R9, R0, -R9 ;               /* 0x5c11000000970009 */
    /*0098*/                   LOP3.LUT R0, R9, R6, R4, 0x96 ;  /* 0x5be7020960670900 */
                                                                /* 0x001f9800fec007e6 */
    /*00a8*/                   LOP3.LUT R9, R6, R4, R9, 0x18 ;  /* 0x5be7048180470609 */
    /*00b0*/                   LOP.XOR R0, R9, R0 ;             /* 0x5c47040000070900 */
    /*00b8*/                   PRMT R9, R0, 0xba98, RZ ;        /* 0x36c07f8ba9870009 */
                                                                /* 0x081fc440fe2007f6 */
    /*00c8*/                   LOP3.LUT R9, R6, R9, R4, 0xb8 ;  /* 0x5be7020b80970609 */
    /*00d0*/                   BFE R4, R9.reuse, 0x808 ;        /* 0x3801000080870904 */
    /*00d8*/                   SHR R5, R9.reuse, 0x18 ;         /* 0x3829000001870905 */
                                                                /* 0x041f9800fe8207f1 */
    /*00e8*/                   BFE R0, R9.reuse, 0x800 ;        /* 0x3801000080070900 */
    /*00f0*/                   BFE R9, R9, 0x810 ;              /* 0x3801000081070909 */
    /*00f8*/                   IADD3 R10, R4, R5, R10 ;         /* 0x5cc005000057040a */
                                                                /* 0x001ffc00fe2007e2 */
    /*0108*/                   IADD3 R0, R9, R10, R0 ;          /* 0x5cc0000000a70900 */
    /*0110*/                   STG.E [R2], R0 ;                 /* 0xeedc200000070200 */
    /*0118*/                   EXIT ;                           /* 0xe30000000007000f */
                                                                /* 0x001f8000fc0007ff */
    /*0128*/                   BRA 0x120 ;                      /* 0xe2400fffff07000f */
    /*0130*/                   NOP;                             /* 0x50b0000000070f00 */
    /*0138*/                   NOP;                             /* 0x50b0000000070f00 */

From memory (I created the intrinsics, ca. 2013): The PTX instructions typically support more functionality than the intrinsics. The intrinsics should always be preferred over inline PTX, unless that extra functionality is needed.

True, and I was also comparing apples and oranges.

It’s interesting that your __vmins4 intrinsic SASS assembly does not contain IMNMX, yet is faster than PTX assembly which uses IMNMX.

By the way, why does SM_30 SASS code contain a VMNMX instruction, but that is not specified in the documentation: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#kepler ?

How can I know for sure that SM_50 devices and higher do not support the VMNMX SASS instruction?

On sm_30, VMNMX is the instruction underlying the PTX instructions vmin and vmax. The hardware has just one instruction which distinguishes the two variants via a predicate (TRUE/FALSE), I think. The same approach is used for IMNMX.

Only the folks at NVIDIA know for sure what their hardware supports. As I recall, after sm_30 pretty much all of these byte-wise SIMD instructions were ripped out except for sum-of-absolute-differences. IIRC, on sm_30, the SIMD instructions had 1/4 the throughput of regular integer instructions, so the simple ones among them could be emulated almost as fast as using the native implementation. Since almost no software used them, they became an unnecessary burden on the hardware. GPU architecture takes a quite RISC-y approach to building processors.

The implementation of most of the SIMD intrinsics is very bit-twiddly. For __vmins4() it would be something similar to the code below. Before the SIMD intrinsics became part of CUDA, they were posted by NVIDIA as a header file implementation under a BSD license; maybe you can still find that file on the internet. Obviously the implementations could have changed since then, but it would give you an idea of how this stuff works under the hood.

    #define UINT32_H4  0x80808080U   // byte-wise sign bits (MSBs)
    #define UINT32_L4  0x01010101U   // byte-wise LSBs 

    /* extract sign bits and convert them into a boolean, byte-wise */
    static __host__ __device__ uint32_t sign_to_bool4 (uint32_t a)
    {
        return (a & UINT32_H4) >> 7;
    }

    /* extend sign bits into mask, byte-wise */
    static __host__ __device__ uint32_t sign_to_mask4 (uint32_t a)
    {
    #if (__CUDA_ARCH__ >= 200)
        asm ("prmt.b32 %0,%0,0,0xba98;" : "+r"(a)); // convert MSBs to masks
    #else
        a = a & UINT32_H4;    // isolate sign bits
        a = a + a - (a >> 7); // extend them to full byte to create mask
    #endif
        return a;
    }

    /* extend boolean into mask, byte-wise */
    static __host__ __device__ uint32_t bool_to_mask4 (uint32_t a)
    {
        return (a << 8) - a;
    }

    static __host__ __device__ uint32_t vhaddu4 (uint32_t a, uint32_t b)
    {
        /* Peter L. Montgomery's observation (newsgroup comp.arch, 2000/02/11,
           https://groups.google.com/d/msg/comp.arch/gXFuGZtZKag/_5yrz2zDbe4J):
           (A+B)/2 = (A AND B) + (A XOR B)/2.
        */
        return (a & b) + (((a ^ b) >> 1) & ~UINT32_H4);
    }

    static __host__ __device__ uint32_t vsetles4 (uint32_t a, uint32_t b)
    {
    #if (__CUDA_ARCH__ >= 300) && (__CUDA_ARCH__ < 500)
        uint32_t t = 0;
        asm ("vset4.s32.s32.le %0,%1,%2,%0;" : "+r"(t) : "r"(a), "r"(b));
        return t;
    #else
        return sign_to_bool4 (vhaddu4 (a, ~b) ^ (a ^ ~b));
    #endif
    }

    static __host__ __device__ uint32_t vcmples4 (uint32_t a, uint32_t b)
    {
    #if (__CUDA_ARCH__ >= 300) && (__CUDA_ARCH__ < 500)
        return bool_to_mask4 (vsetles4 (a, b));
    #else   
        return sign_to_mask4 (vhaddu4 (a, ~b) ^ (a ^ ~b));
    #endif
    }

    static __host__ __device__ uint32_t vmins4 (uint32_t a, uint32_t b)
    {
    #if (__CUDA_ARCH__ >= 300) && (__CUDA_ARCH__ < 500)
        uint32_t t = 0;
        asm ("vmin4.s32.s32.s32 %0,%1,%2,%0;" : "+r"(t) : "r"(a), "r"(b));
        return t;
    #else
        uint32_t t = vcmples4 (a, b);
        return (a & t) | (b & ~t);
    #endif
    }

[Later:]

I found a copy of the original header file here:

Thanks a lot for this information @njuffa , it makes sense now.