Batched CUBLAS Questions

Hello all,

Currently I have a collection of vectors and matrices that I execute within a loop using CUBLAS SGEMV. I would like to use batched CUBLAS calls for this.

Can batched CUBLAS calls be employed for SGEMV? If not, does anyone have any suggestions to convert the code to be able to employee the batched CUBLAS - perhaps translate the SGEMV to SGEMM?

A sample of the code I am using is copied below. Thank you to anyone with some idea(s).

void batchingCalls(int P, int K, 
                             std::vector<std::vector<float> > &A,
                             std::vector<std::vector<float> > &b, 
                             std::vector<std::vector<float> > &c
                             std::vector<float> &alpha)  {

    float *d_A, *d_b, *d_c;
    cublasHandle_t handle;
    int size = P*K;
    int cnt = A.size();
    cublasCreate(&handle);

    for(int i = 0; i < cnt; ++i) {
        cudaMalloc((void**)&d_A,  size*sizeof(float));
        cudaMalloc((void**)&d_b, size*sizeof(float));
        cudaMalloc((void**)&d_c,  K*sizeof(float));

        cudaMemcpy(d_A,  A[i].data(), size*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_b, b[i].data(), size*sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_c, cl[i].data(), K*sizeof(float), cudaMemcpyHostToDevice);

        float beta = 1.0f;

        /* call cublas SGEMV - can this be translated to SGEMM to employ batched CUBLAS ?? */
        cublasSgemv(handle, CUBLAS_OP_T, P, K, &alpha[i], d_A, P, d_b, 1, &beta, d_c, 1);
    
        cudaMemcpy(c[i].data(), d_c, K*sizeof(float), cudaMemcpyDeviceToHost);
    
        cudaFree(d_A);
        cudaFree(d_b);
        cudaFree(d_c);
    }

    cublasDestroy(handle);

}

If the end result you are after is to get concurrency of the Sgemv() calls, I showed a good example with nice overlap in this recent thread;

https://devtalk.nvidia.com/default/topic/838794/gpu-accelerated-libraries/using-cublas-in-different-cuda-streams/

In that case I used streams and the distinct Sgemv() calls were not serialized mostly overlapped in execution.

Look at my response at the end of that thread. I link to real-world examples of for both cuBLAS and cuSparse.

Thanks for the information, great example on concurrency of SGEMV. However, I am really interested in employing batched CUBLAS. I want to find a way to use my code as a batched CUBLAS - if possible. I am curious as to how this would work performance-wise.

Thank you again.

Certainly if you have the same matrix that you are multiplying by a number of vectors using gemv, then it’s straightforward to convert that to a single gemm operation. If you then have groups of vectors where each group is multiplied by single matrix, then each group could be converted to a gemm operation, and the set of all gemm operations could be passed as a batched gemm operation.

I believe that nothing prevents you from calling a gemm operation even if one of your “matrices” is a vector (an nx1 matrix). So I’m also unaware of any limitation that would prevent you from using batched gemm operation to perform a set of gemv operations. But I don’t know about the performance implications either.

Perhaps you should just try it.

You’re right. I’ll give it a try and see what happens.