dp4a instruction usage in Pascal architecture GPUs


I want to multiply two 4-byte numbers using dp4a instruction and want to compare the speedup gain with and witthout using dp4a.
Can anyone let me know how to use dp4a .cu CUDA code to do this ?


dp4a is for multiplying one-byte numbers. It is not used for multiplying 4-byte numbers. 4-byte integer multiply would be done with ordinary CUDA C/C++ code.

Nothing prevents a programmer from writing code that is equivalent to DP4A, so just give it a try. I haven’t looked at the detailed specification for the DP4A instruction and don’t have DP4A-capable hardware, but your code would look something like this:

char4 a, b;
int c, d;
d = a.x * b.x + a.y * b.y + a.z * b.z + a.w + b.w + c;

Unless the compiler automagically compiles that into a DP4A instructions (unlikely, but check by dumping the machine code with cuobjdump --dump-sass) this would give you the baseline performance without DP4A.

As txbob points out, this instruction is not suitable for constructing wide integer multiplies. The basic building block for that is the XMAD instruction. There is no direct access to XMAD, but by writing appropriate PTX code one can get the compiler to generate it. See the following for a worked example:


@txbob - Yes, I figured it out by now. I used the function __dp4a(unsigned int srcA,unsigned int srcB, unsigned int c ) where the arguments are 8-bit unsigned values 0-255 , If I give values above than 255, it rounds up. So, that’S correct behaviour i.e __dp4a() expects 8 bit inputs. Currently,I am setting 3rd argument as 0. as __dp4a() returns multiplication of first two arguments. I think 3rd argument is offset.(not sure)

I don’t understand how dp4a does parallelism and generate int32 accumulated output if I am giving only 8-bit input rather than 32-bit input. What’s the difference in multiplying two 8-bit matrices with ordinary c++ cuda code and with dp4a ?

If you write out a matrix multiply, you will find that each element of the result matrix is a dot product. For source matrices with byte-sized elements DP4A provides speedup because it accumulates four products into the sum in one fell swoop, i.e. it replaces four ordinary XMAD instructions.

Internal to the DP4A instruction, both multiplies and adds can operate with lower latency than regular ones because the multiplies are narrow and the additions can be combined with a carry-save adder, rather than a carry-propagate adder (the speed of additions is limited by the speed of carry propagation).

The third argument to __dp4a() is the 32-bit integer added to the sum of products. When accumulating a dot product, I would expect this input and the output to be the same variable. But by not requiring this, the instruction is more flexible. It’s the same issue as with so called FMA4 fused multiply-adds on the GPU vs the FMA3 fused multiply-adds on Intel CPUs (former more flexible than latter).

Hello njuffa,

Thanks for your explanation. That means, with dp4a you would expect 4x speedup. But I am not getting the same.Maybe it depends on the GPU as well. I am currently using 1080 TI.(Hopefully P40 performs much better)
Below is my code for matrix multiplication (non-square matrices).I have replaced below ordinary c++ code mulitiplication with dp4a as below. I am getting same result matrices as with ordinary c++ code.So I am hoping this implementation is correct.

global void gpu_matrix_mult(int *a,int *b, int *c, int m, int n, int k)
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int sum = 0;
if( col < k && row < m)
for(int i = 0; i < n; i++)
sum += a[row * n + i] * b[i * k + col]; //Normal C++ CUDA code
//sum += __dp4a(a[row * n + i], b[i * k + col], 0); //DP4A code
c[row * k + col] = sum;

Below are my observations with and without dp4a:

  1. If I multiply vector-matrix, the speedup in time is almost same.

:~/no_backup/d1230> ./matrix (WITHOUT DP4A) //Commented above dp4a line in code
please type in m n and k
5000 5000 1
Time elapsed on matrix multiplication of 5000x5000 . 5000x1 on GPU: 13.007168 ms.

:~/no_backup/d1230> vi matrix.cu
:~/no_backup/d1230> nvcc -arch=sm_61 -o matrix matrix.cu

d1230@linse3:~/no_backup/d1230> ./matrix (WITH DP4A) //UN-Commented above dp4a line in code
please type in m n and k
5000 5000 1
Time elapsed on matrix multiplication of 5000x5000 . 5000x1 on GPU: 12.939136 ms.

  1. If I multiply matrix-matrix square matrices, same behavior as above.

  2. However if I multiply matrix-matrix(but not square matrices) , I can notice around 2x speedup with DP4A

./matrix (WITHOUT DP4A) //Commented above dp4a line in code
please type in m n and k
5000 5000 500
Time elapsed on matrix multiplication of 5000x5000 . 5000x500 on GPU: 114.162689 ms.

:~/no_backup/d1230> vi matrix.cu
:~/no_backup/d1230> nvcc -arch=sm_61 -o matrix matrix.cu

~/no_backup/d1230> ./matrix (WITH DP4A) //UN-Commented above dp4a line in code
please type in m n and k
5000 5000 500
Time elapsed on matrix multiplication of 5000x5000 . 5000x500 on GPU: 69.245346 ms.

And this ~2X behaviour is only after certain size of matrices. I guess there exists some overhead limit for dp4a.

Please let me know what can be the cause that dp4a can’T provide improvement in vector-matrix and square matrices multiplication case…

Thanks a lot in advance!!!

I think if you were to use the CUDA profiler, it would indicate that the posted code is dominated by memory throughput, not by computational throughput. So improving the computational throughput by using DP4A is not going to help (as much as you expect).

You would want to tile your matrix operations such that they minimize memory overhead, for example by blocking in registers and/or shared memory. While you can experiment with this yourself, doing this well requires a lot of experience with designing and implementing well-tuned code. The easiest way to reap the benefits of DP4A is to avail yourself of the expertise of NVIDIA’s engineers who have that experience is by using relevant libraries provided by NVIDIA.

You mean using TensorRT library which has INT8 support for dp4a PASCAL architecture ? Which other libraries you refer here ?

tensorrt, cudnn, cublas all have support for int8 computation.