Batch Matrix Multiplication using CuBLAS

Hi Nvidia Team,

Actually, I am working on registering a Plugin for an Operator(Einsum) which is not currently supported in TensorRT. So, instead of implementing a CUDA Kernel, I want to use the CuBLAS Library for Batch Matrix Multiplication.

The Equations I want to implement is(from Einsum Operator):
“ntg, ncg → nct” and " nct, ncp-> ntp"(for Batch Matrix Multiplication)

Info about Einsum op: https://github.com/onnx/onnx/blob/master/docs/Operators.md#Einsum

I needed a guidance in using CuBLAS Library for Batched Matrix Multiplication for the above two Ops.

I am referring to the Available references(cuBLAS :: CUDA Toolkit Documentation, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/), but I am not getting how to use it for the above Equations.

Can you please assist me for the same?

Thanks in Advance,
Darshan C G

Looking forward to the reply.

Thanks

Hi @darshancganji12 , I am not from Nvidia Team, but it seems we are interested in solving the same problem and have complementary backgrounds.

Check this issue, there I show the implementation of Batch Matrix Multiplication using gemmStridedBatched
Batched matrix multiplication copying the input data(CUDA) · Issue #52111 · pytorch/pytorch (github.com)

(1) your ntg, ncg->nct is X2 * X1’, the nct, ncp-> ntp is X2’ * X1

Thus what you need to do is

ntg, ncg->nct use A=X2 and for B=X1 in gemmStridedBatched and pass transA=false, transB=true.

nct, ncp-> ntp use A=X1 and for B=X2 in gemmStridedBatched and pass transA=true, transB=false.

(1) You can verify that claim with

import torch;
X1 = torch.randn(10, 10, 10);
X2 = torch.randn(10, 10, 10);
R1 = torch.einsum('ntg, ncg -> nct', X1, X2)
R2 = torch.einsum('nct, ncp -> ntp', X1, X2)
assert(torch.allclose(R1, torch.bmm(X2, X1.transpose(1,2))))
assert(torch.allclose(R2, torch.bmm(X1.transpose(1,2), X2)))

Hi @alexandre.felipe,

Thank you very much for your kind response. This is a great Help from yourside.

I will implement the same using gemmStridedBatched with your reference.

Thanks,
Darshan

Hi @alexandre.felipe,

The same will not work with cublasSgemmBatched?

Thanks

I have Implemented the same using cublasSgemmBatched,but I want to remove the Loops which is used for Transpose.
if(equation== ‘nct,ncp->ntp’)
{

   //Transpose the matrix A
  for(int k=0; k< N; ++k)
   for (int i = 0; i < L; ++i)
        for (int j = 0; j < K; ++j) {
            transpose[k][j][i] = inputs[0][k][i][j];
        }


   cublasSgemmBatched(mCublas, CUBLAS_OP_N,CUBLAS_OP_N, 
                                  M, L, K, onef, 
                                  reinterpret_cast<const float*>(inputs[1]), M, 
                                  reinterpret_cast<const float*>(transpose), K, 
                                  zerof, reinterpret_cast<const float*>(outputs[0]), M, N)


}

else if(equation=='ntg, ncg -> nct')
{
   //Transpose the matrix B
  for(int a=0; a< N; ++a)
   for (int b = 0; b < O; ++b)
        for (int c = 0; c < M; ++c) {
            transpose[a][c][b] = inputs[1][a][b][c];
        }

   cublasSgemmBatched(mCublas, CUBLAS_OP_N,CUBLAS_OP_N, 
                                  M, L, K, onef, 
                                  reinterpret_cast<const float*>(transpose), M, 
                                  reinterpret_cast<const float*>(inputs[0]), K, 
                                  zerof, reinterpret_cast<const float*>(outputs[0]), M, N)

   //Transpose the output matrix 
  for(int e=0; e< N; ++e)
   for (int f = 0; f < O; ++f)
        for (int g = 0; g < M; ++g) {
            transpose_1[e][g][f] = outputs[0][e][f][g];
        }

Checking the documentation for sublasSgemmBatched it says you don’t have to transpose the data, you just pass the cublasOperation_t CUBLAS_OP_T for the corresponding argument. This avoids waste time copying, and more importantly, save your memory for what really matter.

Furthermore I see you are computing transpose(transpose(B) * A) on the second case, that could be symply transpose(A) * B , Check Property 3 Transpose Properties - Wikipedia

I don’t know if this will affect your use case but in pytorch I have observed that transposing the second and not transposing the first parameter of bmm makes it faster.

(I am not compiling or running)
I think you could simplify the cod as follows

ntg,ncg->nct = X2 * X1’, X2 is (L x K), X1 is (M x K), result is (L x M)

   cublasSgemmBatched(mCublas, CUBLAS_OP_N,CUBLAS_OP_T, 
                                  L, M, K, onef, 
                                  reinterpret_cast<const float*>(inputs[1]), K, 
                                  reinterpret_cast<const float*>(inputs[0]), K, 
                                  zerof, reinterpret_cast<const float*>(outputs[0]), M, N)

nct,ncp->ntp = X2’ * X1, X2 is (K x L), X1 is (K x M), result is (L x M)

   cublasSgemmBatched(mCublas, CUBLAS_OP_T,CUBLAS_OP_N, 
                                  L, M, K, onef, 
                                  reinterpret_cast<const float*>(inputs[1]), L, 
                                  reinterpret_cast<const float*>(inputs[0]), M, 
                                  zerof, reinterpret_cast<const float*>(outputs[0]), M, N)

I understand that the leading dimension of a matrix is the number of elements per row in this case.

Hi @alexandre.felipe,

Thank you very much for your kind response.

I think **nct,ncp->ntp , X2 is (K x L), X1 is (K x M), has to result in (M x L)

Hi @alexandre.felipe,

I tried executing for ntg,ncg->nct(In the same way you suggested), but I am getting an error during the execution.

Could you share the your notebook (google colab is good for this)

I must have missed something.

Hi @alexandre.felipe,

Thanks very much for your kind reply.
Sure I will share the Notebook.

Attaching Below:
Einsum_CUBLAS.ipynb (37.3 KB)

Thanks and Regards,
Darshan

Well the problem is that the python representation of the arrays is row-major while the cublas representation is column-major.

This is equivalent to transposing all the matrices, and thus the order of the arguments is changed.

Here you see how to compute each matrix product.

 'AB': ('B', 'n', 'A', 'n'),
 'AB^T': ('B', 't', 'A', 'n'),
 'A^TB': ('B', 'n', 'A', 't'),
 'A^TB^T': ('B', 't', 'A', 't'),
 'BA': ('A', 'n', 'B', 'n'),
 'BA^T': ('A', 't', 'B', 'n'),
 'B^TA': ('A', 'n', 'B', 't'),
 'B^TA^T': ('A', 't', 'B', 't')
}

The leading dimension is still the number of columns in the python representation, but is the number of columns in the cublas representation.

# ntg, ncg -> nct
cublas.cublasSgemmBatched(cublas_handle, 't','n', 
                                  m, n, k, alpha, 
                                  a_arr.gpudata, k, 
                                  b_arr.gpudata, k, 
                                  beta, c_arr.gpudata, m, l)
# nct, ncp -> ntp
cublas.cublasSgemmBatched(cublas_handle, 'n', 't',
                                  n, k, m, alpha, 
                                  b_arr.gpudata, n, 
                                  a_arr.gpudata, k, 
                                  beta, c_arr.gpudata, n, l)

Einsum_CUBLAS (1).ipynb (16.7 KB)

Hi @alexandre.felipe,

Thank you very very much for your all help. Now I am able to get the correct results and also the Plugin has been implemented. Thanks for your time, thanks for sharing your knowledge, and Thanks for your kindness.

Solved!

Thanks and Regards,
Darshan

Your last post is private.

What is the type of your input, it is declared as const void* const * if it is a const* float then it is natural to have the invalid memory error. You need to pass an array of pointers to cublasSgemmBatched if you want to skip this step then you have to use the strided version.

Hi @alexandre.felipe,

Thank you for your kind reply.

Following your response, I am implementing referring to this: https://github.com/NVIDIA/TensorRT/blob/88f9cae8afb8df8f7da45e64b46af8b97e618798/samples/opensource/samplePlugin/fcPlugin.h#L170 where the inputs are declared as const void* const* and it is type casted to const float*. I am doing the same thing. But I am not sure why there is an error.

The error is: Cuda failure: an illegal memory access was encountered

Can you please assist me to resolve this error?

Thanks,
Darshan

This is very particular to your problem, not so much about the NVIDIA modules, maybe it is time to move to stackoverflow. I would say, try to use your memory with a simpler function, for instance try to calculate the sum (on the GPU) of all the elements of your matrices using the pointer. If that works then I would say that you are passing the wrong values for the sizes.

Hi @alexandre.felipe,

Thank you very much once again!!. I was able to resolve the issue by using cublasSgemmStridedBatched.

Once last doubt is about the stride parameter, how that parameter matters? Actually, I had given 1 unkowingly and it worked, but not sure how to tweak it in my case(for ntg, ncg → nct and nct, ncp-> ntp).
Could you please help me to understand that? would be a great help.

And Also, I have placed the parameters in the same place in cublasSgemmStridedBatched() as I have placed in
cublasSgemmBatched(), and I have assigned stride value to 1.

Thanks,
Darshan

Hi @alexandre.felipe,

One last doubt is about the stride parameter is cublasSgemmStridedBatched , how that parameter matters? Actually, I had given 1 unkowingly and it worked, but not sure how to tweak it in my case(for ntg, ncg → nct and nct, ncp-> ntp ).
Could you please help me to understand that? would be a great help.

And Also, I have placed the parameters in the same place in cublasSgemmStridedBatched() as I have placed in
cublasSgemmBatched(), and I have assigned stride value to 1.

Thanks,
Darshan