Performance Difference DP4A vs VMIN4 using GTX1060

I wonder whether the following code for summing the four int8_t components might be faster. It depends on the throughput of the PRMT instruction; I don’t know what that is for sm_61.

uint32_t packed_int8;
int32_t sum;
asm ("{\n\t"
     ".reg .s32 s,t,u,v;\n\t"
     "prmt.b32  s,%1,0,0x8880;\n\t"  // extract and sign extend byte 0
     "prmt.b32  t,%1,0,0x9991;\n\t"  // extract and sign extend byte 1
     "prmt.b32  u,%1,0,0xaaa2;\n\t"  // extract and sign extend byte 2
     "prmt.b32  v,%1,0,0xbbb3;\n\t"  // extract and sign extend byte 3
     "add.s32   %0, s,t;\n\t"
     "add.s32   %0,%0,u;\n\t"
     "add.s32   %0,%0,v;\n\t"
     "}"
     : "=r"(sum)
     : "r"(packed_int8));

for sm_61, since we know that dp4a is available, I wonder if summing the 4 could be done quickly with a dp4a instruction using the results to sum and 0x01010101

unsigned &T = reinterpret_cast<unsigned &>(t);
int &Ts = reinterpret_cast<int &>(t);
T = __vmins4(A, B);
  
d[0] = __dp4a(Ts, 0x01010101, d[0]);

Thanks for the pointer. Yeah, that should work just fine. By the way, is there a reason __dp4a is not mentioned anywhere in the CUDA Programming Guide? I wanted to refresh my memory on how exactly __dp4a is specified, and could not find it when searching the latest PDF.

I infer from the PTX documentation that passing int32_t (or int) indicates that we are passing char4 whose elements should be sign extended while passing uint32_t (or unsigned int) indicates we are passing uchar4 whose elements should be zero extended?

[Later:]

Looking at the relevant header file (sm_61_intrinsics.h), I see that there is

int __dp4a (int, int, int)
int __dp4a (char4, char4, int)

for the signed operation, and

unsigned int __dp4a (unsigned int, unsigned int, unsigned int)
unsigned int __dp4a (uchar4, uchar4, unsigned int)

for the unsigned operation.

While the code generated for __dp4a() with int arguments looks as expected, the code generated for __dp4a() with char4 arguments looks needlessly complex. I would have expected the generated code to look the same. I wonder what is going on there?

My little test app, compiled on CUDA 11.1:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define USE_INT (0)

#if USE_INT
__global__ void kernel (int32_t a, int32_t b, int32_t *res)
#else // USE_INT
__global__ void kernel (char4 a, char4 b, int32_t *res)
#endif // USE_INT
{
    *res = __dp4a (a, b, 0);
}

int main (void)
{
#if USE_INT
    int32_t a = 0xfcfdfeff;
    int32_t b = 0x01010101;
#else // USE_INT
    char4 a = {-1, -2, -3, -4};
    char4 b = {1, 1, 1, 1};
#endif // USE_INT
    int32_t res, *res_d = 0;
    cudaMalloc ((void**)&res_d, sizeof res_d[0]);
    kernel<<<1,1>>>(a, b, res_d);
    cudaMemcpy (&res, res_d, sizeof res, cudaMemcpyDeviceToHost);
    printf ("sum = %d\n", res);
    return EXIT_SUCCESS;
}

With USE_INT = 1, the code generated for an sm_61 target looks as expected:

        code for sm_61
                Function : _Z6kerneliiPi
        .headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                             
        /*0008*/                   MOV R1, c[0x0][0x20] ; 
        /*0010*/                   MOV R0, c[0x0][0x140] ; 
        /*0018*/                   MOV R2, c[0x0][0x148] ; 
                                                                            
        /*0028*/                   MOV R3, c[0x0][0x14c] ;                   
        /*0030*/                   IDP.4A.S8.S8 R0, R0, c[0x0][0x144], RZ ; 
        /*0038*/                   STG.E [R2], R0 ;                        
                                                                
        /*0048*/                   NOP ;            
        /*0050*/                   NOP ;            
        /*0058*/                   NOP ;                
                                                              
        /*0068*/                   EXIT ;    
        /*0070*/                   BRA 0x70 ;  
        /*0078*/                   NOP;   
                ..........

I would have expected the code generated for USE_INT = 0 to look identical, but instead got this monstrosity:

	code for sm_61
		Function : _Z6kernel5char4S_Pi
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                 
        /*0008*/                   MOV R1, c[0x0][0x20] ;        
        /*0010*/                   LDC.U8 R3, c[0x0][0x142] ;    
        /*0018*/                   LDC.U8 R6, c[0x0][0x146] ;    
                                                                 
        /*0028*/                   LDC.U8 R0, c[0x0][0x143] ;    
        /*0030*/                   LDC.U8 R7, c[0x0][0x141] ;    
        /*0038*/                   LDC.U8 R4, c[0x0][0x147] ;    
                                                                 
        /*0048*/                   LDC.U8 R8, c[0x0][0x145] ;    
        /*0050*/                   MOV R2, c[0x0][0x140] ;       
        /*0058*/                   MOV R5, c[0x0][0x144] ;       
                                                                 
        /*0068*/                   LOP32I.AND R2, R2, 0xff ;     
        /*0070*/                   LOP32I.AND R5, R5, 0xff ;     
        /*0078*/                   PRMT R2, R3, 0x7054, R2 ;     
                                                                 
        /*0088*/                   PRMT R5, R6, 0x7054, R5 ;     
        /*0090*/                   SHL R3, R8, 0x8 ;             
        /*0098*/                   PRMT R2, R0, 0x654, R2 ;      
                                                                 
        /*00a8*/                   SHL R0, R7, 0x8 ;             
        /*00b0*/                   PRMT R4, R4, 0x654, R5 ;      
        /*00b8*/                   LOP.OR R0, R2, R0 ;           
                                                                 
        /*00c8*/                   LOP.OR R5, R4, R3 ;           
        /*00d0*/                   MOV R2, c[0x0][0x148] ;       
        /*00d8*/                   MOV R3, c[0x0][0x14c] ;       
                                                                 
        /*00e8*/                   IDP.4A.S8.S8 R0, R0, R5, RZ ; 
        /*00f0*/                   STG.E [R2], R0 ;              
        /*00f8*/                   NOP ;                         
                                                                 
        /*0108*/                   EXIT ;                        
        /*0110*/                   BRA 0x110 ;                   
        /*0118*/                   NOP;                          
                                                                 
        /*0128*/                   NOP;                          
        /*0130*/                   NOP;                          
        /*0138*/                   NOP;                          
		..........

I used this blog (specifically the section: “Integer Dot Product Intrinsics”) for reference. I agree that is not documentation. Neither is a header file. Neither are the PTX docs. I’ve filed a bug (3179941) just now to see if we can get the __dp4a intrinsics documented.

I’ve also filed a bug (3179943) for the observation you pointed out comparing int vs. char4 usage.

Thanks.

The code generated for the char4 variant of __dpa4() looks a bit like the compiler assumes both a lack of alignment and a lack of knowledge regarding physical ordering of the struct fields, when in fact proper 4-byte alignment of char4 and physical structure field ordering (.x, .y, .z, .w stored consecutively in increasing address order) should both be well-known inside the compiler. Meaning the char4 arguments can just be re-interpreted as int and the compiler would be all set to emit a couple of 32-bit loads followed by IDP.4A.S8.S8.

Interesting to see: using the __vmins4() intrinsic vs PTX asm vmin4.s32.s32.s32.add barely makes a difference when the matrix size is coming closer to 4096 on a RTX 2080Ti, the performance difference grows for larger and smaller matrices.

In graph: intrinsic is called ‘Matrix Minimum (optimized)’, PTX asm is called ‘Matrix Minimum’

Have you looked at the different variants with the profiler? I would suspect your observations to boil down to varying efficiency in accessing memory, which dilutes effects from efficiency using the computational cores. In other words, slight shifts in balance between the task being compute bound vs being memory bound.

But the task is never memory bound? Because DP4A is always magnitudes faster, using the same access pattern.
DP4A in the graph is ‘Matrix Multiplication’ (orange/red line)

I probably expressed myself poorly by using “bound”. Let me try again: “A shift in the percentage of time that goes towards memory access versus the percentage of time that goes towards computation”. Instead of speculating what the root cause could be, it seems highly advisable to first run a few variants in the profiler, then discuss any salient differences that are observed (if necessary).

Thanks for your help! It is currently too out-of-scope for me to analyse the root cause, but the answer must lie in specific assembly instructions used in PTX vmin4.s32.s32.s32.add which are not used in __vmins4() and its effect on the scheduler (because the input data + memory access patterns are equal) and maybe combined with the fact that the content of the input matrices are predictable (the numbers are a consistently repeating sequence of length 4)…

Or maybe the ability to cache these lines (see my post earlier for full code):

    /*0170*/                   MOV R2, c[0x0][0x140] ;                /* 0x4c98078005070002 */
    /*0178*/                   MOV R3, c[0x0][0x144] ;                /* 0x4c98078005170003 */

Yes that works and improved performance by another factor 2 on an RTX2080Ti. Don’t have the results for GTX1060 but I think it is safe to assume it holds for GTX1060 aswell.

Hi Robert, are you able to give an update on this bug please?

For 3179943 the issue is understood and the source of the discrepancy is understood. The fix is understood to be not trivial. Very little movement on a fix at this point.

For 3179941 there has been essentially no movement on updating the docs.

I’m not able to make forward looking statements. Until there is a fix, the workaround from this thread is to prefer the int usage over the char4 usage.

Thank you.

The same discrepancy occurs between uint and uchar4.