cublasGemmEx doesn't work with INT8 utilizing __dp4a instruction on NVIDIA 1080TI


As per documentation from this link,
cublasGemmEx() is not working for INT8 matrix multiplications.

It says:
“For CUDA_R_32I computation type the matrix types combinations supported by cublasGemmEx are listed below. This path is only supported with alpha, beta being either 1 or 0; A, B being 32-bit aligned; and lda, ldb being multiples of 4.” and ““the combination of the parameters Atype, Btype and Ctype and the algorithm type, algo is not supported””

I am getting below error:

Attached my code below: don’t know what’s wrong. It’S not working according to documentation.
I checked for all the algos, but it doesnt work.
My cublas compilation is also correct picking cublas from cuda-8.0 :

nvcc -arch=sm_61 -o cublas -L /usr/local/cuda-8.0/lib64/ -lcublas

#undef _GLIBCXX_USE_INT128

#include <thrust/device_vector.h>
#include <cublas_v2.h>

// C-style indexing
int ci(int row, int column, int nColumns) {
return row*nColumns+column;

int main(void)
int rowD = 40 ; // number of rows of D
int colD = 40; // number of columns of D
int rowE = colD; // number of rows of E
int colE = 40; // number of columns of E
int rowF = rowD;
int colF = colE;

// initialize data
thrust::device_vector D(rowD * colD);
thrust::device_vector E(rowE * colE);
thrust::device_vector F(rowF * colF);

for (size_t i = 0; i < rowD; i++){
for (size_t j = 0; j < colD; j++){
D[ci(i,j,colD)]=(i+j) ;
// std::cout << D[ci(i,j,colD)] << " ";
//std::cout << “\n”;

for (size_t i = 0; i < rowE; i++){
for (size_t j = 0; j < colE; j++){
//std::cout << E[ci(i,j,colE)] << " ";
//std::cout << “\n”;
for (size_t i = 0; i < rowF; i++)
for (size_t j = 0; j < colF; j++)

cublasHandle_t handle;

/* Initialize CUBLAS */

cublasStatus_t status = cublasCreate(&handle);
if (status != CUBLAS_STATUS_SUCCESS) {
std::cerr << “!!! CUBLAS initialization error\n”;

float alpha = 1.0f;float beta = 0.0f;
#if 0
status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
colE, rowD, colD,
&alpha, thrust::raw_pointer_cast(&E[0]), colE,
thrust::raw_pointer_cast(&D[0]), colD,
&beta, thrust::raw_pointer_cast(&F[0]), colE);// colE x rowD
status = cublasGemmEx(handle, CUBLAS_OP_N, CUBLAS_OP_N,
colE, rowD, colD,
&alpha, thrust::raw_pointer_cast(&E[0]), CUDA_R_8I ,colE,
thrust::raw_pointer_cast(&D[0]), CUDA_R_8I ,colD,
&beta, thrust::raw_pointer_cast(&F[0]), CUDA_R_32I ,colE, CUDA_R_32I,CUBLAS_GEMM_ALGO0);// colE x rowD
if (status != CUBLAS_STATUS_SUCCESS) {
std::cerr << “!!! 0 kernel execution error.\n” << status << std::endl;

#if 0
for (size_t i = 0; i < rowF; i++){
for (size_t j = 0; j < colF; j++){
std::cout << F[ci(i,j,colF)] << " ";
status = cublasDestroy(handle);
if (status != CUBLAS_STATUS_SUCCESS) {
std::cerr << “!!! shutdown error (A)\n”;

return 0;

Can you clarify whether you have installed CUDA 8.0.61 and have also installed:

Patch 2 (Released Jun 26, 2017)
cuBLAS Patch Update to CUDA 8: Includes performance enhancements and bug-fixes

When I compile and run your code using CUDA 8 on a Pascal Titan X, I get the following output:

$ nvcc -arch=sm_61 -o t414 -lcublas
$ ./t414
 ** On entry to GEMM_EX  parameter number 6 had an illegal value
!!!! 0 kernel execution error.

If you re-read the cublas documentation, you will see that for CUDA_R_32I compute type, the use of:

float alpha = 1.0f;float beta = 0.0f;

is incorrect. When I fix that issue, your code compiles and runs without runtime error for me.

The CUBLAS_STATUS_NOT_SUPPORTED error you are getting is a different error, which suggests to me that you are attempting to run this code on a non cc6.1 device. Perhaps your


call is not selecting the device you think.

float alpha = 1.0f;float beta = 0.0f; were the culprits.
I have one more question:

cublasGemmEx() with Atype = CUDA_R_8I, Btype = CUDA_R_8I and computetype CUDA_32_8I, only does multiplication of signed 8-bit numbers(-128 to 127) . Does it also support UNSIGNED 8-bit multiplication(0-255), something like CUDA_R_8U for Atype, Btype and CUDA_R_32U for computetype.

How can I do multiplication of unsigned 8-bit multiplications ?


Does cublasGemmEx() internally work with threads? Because i am able to calculate 10001000 * 10001000 times multplications in ~100us on gpu .

What to do if my lda,ldb are not multiples of 4?padding with zeros?

does it use __dp4a instruction with CUDA_R_32I compute type?

How should i calculate Tflops?

used formula (2*n^3 - n^2)/execution_timeongpu/10^12 = 20 Tflops for above matrix.
I think it’s too much on gpu 1080ti. What i am doing wrong?

Not sure what you mean. CPU threads? I don’t know, maybe, but it’s not essential to the operation. The CUDA GPU driver will spin up CPU threads for various purposes.

GPU threads? Of course it uses GPU threads.

That should work.

Yes, it should, that is the whole point.

There are no Tflops. This is not floating-point arithmetic. You could compute Tops.

20 Tops seems plausible to me. The 1080Ti should be capable of a theoretical peak of over 40 Tops in this mode

(1080Ti should be in the same range as P40 for this computation)

Hi txbob,

Thanks for your answers and clarifications very much !!!

Just another follow-up question,

cublasGemmEx() with Atype = CUDA_R_8I, Btype = CUDA_R_8I and computetype CUDA_32_8I, only does multiplication of signed 8-bit numbers(-128 to 127) .

Does it also support UNSIGNED 8-bit multiplication(0-255), something like CUDA_R_8U for Atype, Btype and CUDA_R_32I for computetype. The documentation doesn’t mention anything for this.

How can I do multiplication of unsigned 8-bit multiplications using cublasGemmEx() ?


large matrixes can be multiplied in O(n^2.7) time

The datatype is indeed CUDA_R_8U, which is mentioned here:

However I don’t see any indication that CUDA_R_8U is supported for any computations in CUBLAS. You could file an enhancement request (bug) at

The dp4a instruction itself does appear to know how to handle unsigned integer types, but that is not integrated into CUBLAS AFAICT:

Honestly this doesn’t surprise me, as CUBLAS (roughly speaking) emulates a BLAS library implementation, and nearly every other BLAS-like function that I can find operates on signed types.

Hi txbob,

I found a API in cublas_api.h in CUDA 8.0 which seems to support UINT8 bit multiplications :

cublasUint8gemmBias (cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb, cublasOperation_t transc,
int m, int n, int k,
const unsigned char *A, int A_bias, int lda,
const unsigned char *B, int B_bias, int ldb,
unsigned char *C, int C_bias, int ldc,
int C_mult, int C_shift);

However, I cannot understand it’s arguements. Why is C is a unsigned int * pointer ? If it’S intended to hold the result of A.B matrix multiplication of UINT8 , it sshould be INT32.
What are other parameters : C_mult, C_shift ?

Please give your opinion.
I would like to multiply 2 UINT8 matrices and hope to see the result in C resulting matrix.

Correct me if I am wrong .
Thanks !!!

Why? If the inputs are of an unsigned integer type, it seems natural that the output would be of an unsigned integer type as well. Since DP4A accumulates the 16-bit products of 8-bit factors into a 32-bit sum, it seems very much appropriate that the output should be an ‘unsigned int’, which is essentially a uint32_t on all platforms supported by CUDA.

A more interesting question is why the API is “hidden”. Either it was inadvertently left out of the documentation, in which case NVIDIA needs to fix the docs, or it is not fully functional or used for internal purposes only, in which case you should not use it (anything undocumented has the tendency to change or go away without prior notice). You might consider filing a bug against the documentation to find out which of the two cases applies.


I meant the resulting matrix C is given in this API as ‘unsigned char *C’ which means it can hold only 8-bit unsigned values and not uint32_t values. This is my concern. How will it accumulate ? It will rather overflow.


No idea. Without documentation you would have to reverse engineer it. Maybe there was a potential use case and it was abandoned during development, but someone failed to remove the API prior to shipping. Maybe it is an experimental interface for internal or temporary use. However if it were the latter I would expect an accompanying comment warning off regular users (at least that is what I did when I had to expose a few functions in a CUDA library years ago that were only to be used internally until new infrastructure was put into place).