Batch Matrix Multiplication using CuBLAS

darshancganji12 · February 1, 2021, 1:26pm

Hi Nvidia Team,

Actually, I am working on registering a Plugin for an Operator(Einsum) which is not currently supported in TensorRT. So, instead of implementing a CUDA Kernel, I want to use the CuBLAS Library for Batch Matrix Multiplication.

The Equations I want to implement is(from Einsum Operator):
“ntg, ncg → nct” and " nct, ncp-> ntp"(for Batch Matrix Multiplication)

Info about Einsum op: https://github.com/onnx/onnx/blob/master/docs/Operators.md#Einsum

I needed a guidance in using CuBLAS Library for Batched Matrix Multiplication for the above two Ops.

I am referring to the Available references(cuBLAS :: CUDA Toolkit Documentation, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/), but I am not getting how to use it for the above Equations.

Can you please assist me for the same?

Thanks in Advance,
Darshan C G

darshancganji12 · February 2, 2021, 7:06am

Looking forward to the reply.

Thanks

alexandre.felipe · February 12, 2021, 9:37am

Hi @darshancganji12 , I am not from Nvidia Team, but it seems we are interested in solving the same problem and have complementary backgrounds.

Check this issue, there I show the implementation of Batch Matrix Multiplication using gemmStridedBatched
Batched matrix multiplication copying the input data(CUDA) · Issue #52111 · pytorch/pytorch (github.com)

(1) your ntg, ncg->nct is X2 * X1’, the nct, ncp-> ntp is X2’ * X1

Thus what you need to do is

ntg, ncg->nct use A=X2 and for B=X1 in gemmStridedBatched and pass transA=false, transB=true.

nct, ncp-> ntp use A=X1 and for B=X2 in gemmStridedBatched and pass transA=true, transB=false.

(1) You can verify that claim with

import torch;
X1 = torch.randn(10, 10, 10);
X2 = torch.randn(10, 10, 10);
R1 = torch.einsum('ntg, ncg -> nct', X1, X2)
R2 = torch.einsum('nct, ncp -> ntp', X1, X2)
assert(torch.allclose(R1, torch.bmm(X2, X1.transpose(1,2))))
assert(torch.allclose(R2, torch.bmm(X1.transpose(1,2), X2)))

darshancganji12 · February 17, 2021, 6:47am

Hi @alexandre.felipe,

Thank you very much for your kind response. This is a great Help from yourside.

I will implement the same using gemmStridedBatched with your reference.

Thanks,
Darshan

darshancganji12 · February 18, 2021, 9:41am

Hi @alexandre.felipe,

The same will not work with cublasSgemmBatched?

Thanks

darshancganji12 · February 18, 2021, 10:23am

I have Implemented the same using cublasSgemmBatched,but I want to remove the Loops which is used for Transpose.
if(equation== ‘nct,ncp->ntp’)
{

   //Transpose the matrix A
  for(int k=0; k< N; ++k)
   for (int i = 0; i < L; ++i)
        for (int j = 0; j < K; ++j) {
            transpose[k][j][i] = inputs[0][k][i][j];
        }


   cublasSgemmBatched(mCublas, CUBLAS_OP_N,CUBLAS_OP_N, 
                                  M, L, K, onef, 
                                  reinterpret_cast<const float*>(inputs[1]), M, 
                                  reinterpret_cast<const float*>(transpose), K, 
                                  zerof, reinterpret_cast<const float*>(outputs[0]), M, N)


}

else if(equation=='ntg, ncg -> nct')
{
   //Transpose the matrix B
  for(int a=0; a< N; ++a)
   for (int b = 0; b < O; ++b)
        for (int c = 0; c < M; ++c) {
            transpose[a][c][b] = inputs[1][a][b][c];
        }

   cublasSgemmBatched(mCublas, CUBLAS_OP_N,CUBLAS_OP_N, 
                                  M, L, K, onef, 
                                  reinterpret_cast<const float*>(transpose), M, 
                                  reinterpret_cast<const float*>(inputs[0]), K, 
                                  zerof, reinterpret_cast<const float*>(outputs[0]), M, N)

   //Transpose the output matrix 
  for(int e=0; e< N; ++e)
   for (int f = 0; f < O; ++f)
        for (int g = 0; g < M; ++g) {
            transpose_1[e][g][f] = outputs[0][e][f][g];
        }

alexandre.felipe · February 18, 2021, 10:58pm

Checking the documentation for sublasSgemmBatched it says you don’t have to transpose the data, you just pass the cublasOperation_t CUBLAS_OP_T for the corresponding argument. This avoids waste time copying, and more importantly, save your memory for what really matter.

Furthermore I see you are computing transpose(transpose(B) * A) on the second case, that could be symply transpose(A) * B , Check Property 3 Transpose Properties - Wikipedia

I don’t know if this will affect your use case but in pytorch I have observed that transposing the second and not transposing the first parameter of bmm makes it faster.

(I am not compiling or running)
I think you could simplify the cod as follows

ntg,ncg->nct = X2 * X1’, X2 is (L x K), X1 is (M x K), result is (L x M)

   cublasSgemmBatched(mCublas, CUBLAS_OP_N,CUBLAS_OP_T, 
                                  L, M, K, onef, 
                                  reinterpret_cast<const float*>(inputs[1]), K, 
                                  reinterpret_cast<const float*>(inputs[0]), K, 
                                  zerof, reinterpret_cast<const float*>(outputs[0]), M, N)

nct,ncp->ntp = X2’ * X1, X2 is (K x L), X1 is (K x M), result is (L x M)

   cublasSgemmBatched(mCublas, CUBLAS_OP_T,CUBLAS_OP_N, 
                                  L, M, K, onef, 
                                  reinterpret_cast<const float*>(inputs[1]), L, 
                                  reinterpret_cast<const float*>(inputs[0]), M, 
                                  zerof, reinterpret_cast<const float*>(outputs[0]), M, N)

I understand that the leading dimension of a matrix is the number of elements per row in this case.

darshancganji12 · February 19, 2021, 6:29am

Hi @alexandre.felipe,

Thank you very much for your kind response.

I think **nct,ncp->ntp , X2 is (K x L), X1 is (K x M), has to result in (M x L)

darshancganji12 · February 19, 2021, 7:14am

Hi @alexandre.felipe,

I tried executing for ntg,ncg->nct(In the same way you suggested), but I am getting an error during the execution.

alexandre.felipe · February 19, 2021, 1:26pm

Could you share the your notebook (google colab is good for this)

I must have missed something.

darshancganji12 · February 19, 2021, 1:57pm

Hi @alexandre.felipe,

Thanks very much for your kind reply.
Sure I will share the Notebook.

Attaching Below:
Einsum_CUBLAS.ipynb (37.3 KB)

Thanks and Regards,
Darshan

alexandre.felipe · February 19, 2021, 5:31pm

Well the problem is that the python representation of the arrays is row-major while the cublas representation is column-major.

This is equivalent to transposing all the matrices, and thus the order of the arguments is changed.

Here you see how to compute each matrix product.

 'AB': ('B', 'n', 'A', 'n'),
 'AB^T': ('B', 't', 'A', 'n'),
 'A^TB': ('B', 'n', 'A', 't'),
 'A^TB^T': ('B', 't', 'A', 't'),
 'BA': ('A', 'n', 'B', 'n'),
 'BA^T': ('A', 't', 'B', 'n'),
 'B^TA': ('A', 'n', 'B', 't'),
 'B^TA^T': ('A', 't', 'B', 't')
}

The leading dimension is still the number of columns in the python representation, but is the number of columns in the cublas representation.

# ntg, ncg -> nct
cublas.cublasSgemmBatched(cublas_handle, 't','n', 
                                  m, n, k, alpha, 
                                  a_arr.gpudata, k, 
                                  b_arr.gpudata, k, 
                                  beta, c_arr.gpudata, m, l)

# nct, ncp -> ntp
cublas.cublasSgemmBatched(cublas_handle, 'n', 't',
                                  n, k, m, alpha, 
                                  b_arr.gpudata, n, 
                                  a_arr.gpudata, k, 
                                  beta, c_arr.gpudata, n, l)

Einsum_CUBLAS (1).ipynb (16.7 KB)

darshancganji12 · February 20, 2021, 5:01am

Hi @alexandre.felipe,

Thank you very very much for your all help. Now I am able to get the correct results and also the Plugin has been implemented. Thanks for your time, thanks for sharing your knowledge, and Thanks for your kindness.

Solved!

Thanks and Regards,
Darshan

alexandre.felipe · February 25, 2021, 12:08pm

Your last post is private.

What is the type of your input, it is declared as const void* const * if it is a const* float then it is natural to have the invalid memory error. You need to pass an array of pointers to cublasSgemmBatched if you want to skip this step then you have to use the strided version.

darshancganji12 · February 25, 2021, 12:29pm

Hi @alexandre.felipe,

Thank you for your kind reply.

Following your response, I am implementing referring to this: https://github.com/NVIDIA/TensorRT/blob/88f9cae8afb8df8f7da45e64b46af8b97e618798/samples/opensource/samplePlugin/fcPlugin.h#L170 where the inputs are declared as const void* const* and it is type casted to const float*. I am doing the same thing. But I am not sure why there is an error.

The error is: Cuda failure: an illegal memory access was encountered

Can you please assist me to resolve this error?

Thanks,
Darshan

alexandre.felipe · February 25, 2021, 5:37pm

This is very particular to your problem, not so much about the NVIDIA modules, maybe it is time to move to stackoverflow. I would say, try to use your memory with a simpler function, for instance try to calculate the sum (on the GPU) of all the elements of your matrices using the pointer. If that works then I would say that you are passing the wrong values for the sizes.

darshancganji12 · February 25, 2021, 7:20pm

Hi @alexandre.felipe,

Thank you very much once again!!. I was able to resolve the issue by using cublasSgemmStridedBatched.

Once last doubt is about the stride parameter, how that parameter matters? Actually, I had given 1 unkowingly and it worked, but not sure how to tweak it in my case(for ntg, ncg → nct and nct, ncp-> ntp).
Could you please help me to understand that? would be a great help.

And Also, I have placed the parameters in the same place in cublasSgemmStridedBatched() as I have placed in
cublasSgemmBatched(), and I have assigned stride value to 1.

Thanks,
Darshan

darshancganji12 · March 2, 2021, 6:51pm

Hi @alexandre.felipe,

One last doubt is about the stride parameter is cublasSgemmStridedBatched , how that parameter matters? Actually, I had given 1 unkowingly and it worked, but not sure how to tweak it in my case(for ntg, ncg → nct and nct, ncp-> ntp ).
Could you please help me to understand that? would be a great help.

And Also, I have placed the parameters in the same place in cublasSgemmStridedBatched() as I have placed in
cublasSgemmBatched(), and I have assigned stride value to 1.

Thanks,
Darshan

Topic		Replies	Views
Batch Matrix Multiplication using CuBLAS GPU-Accelerated Libraries tensorrt , cuda , c-plus-plus	1	1064	February 19, 2021
Where can I find working examples for the new cuBLASLt library? GPU-Accelerated Libraries	35	5996	March 16, 2020
Excuse me, I would like to ask the following questions about the use of the cublasZgemmBatched function GPU-Accelerated Libraries cublas	1	405	June 26, 2023
Strided batched GEMM is far slower when second matrix is transposed GPU-Accelerated Libraries cuda	4	200	November 8, 2024
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10064	March 24, 2009
Can cublasXt be used to process batched matmul? AI Foundation Models and Endpoints cuda	0	16	September 11, 2024
[Feature request] more flexible cublas<T>gemmStridedBatched CUDA Programming and Performance	1	628	June 12, 2018
cuBLAS launch 5 times threads blocks more than expected GPU-Accelerated Libraries cublas	4	454	April 11, 2024
The best input layout settings in CuBlas GPU-Accelerated Libraries cublas	4	243	August 27, 2024
Using Batched matrix multiplication GPU-Accelerated Libraries cublas	2	65	October 31, 2024

Batch Matrix Multiplication using CuBLAS

Related topics