Hundreds of parallel matrix-vector multiplications with cuBLAS

To my understanding a cuBLAD handle (cublasHandle_t) is a wrapper for a cuda stream (cudaStream_t); this said, if we attempt to launch hundreds of parallel matrix-vector multiplications using cuBLAS and constructing the corresponding handles, these will be serialized in groups of 16, i.e., only 16 parallel cuBLAS kernels will actually execute in parallel. So, if this is true, is there a way (possibly using some other library) to perform lots of (actually) parallel matrix-vector multiplications?

Note: If the multiplications we need to perform in parallel are in the form Ax1, Ax2, …, AxN, then we can simply pack x1, x2, …, xN is a matrix X = [x1, x2, …, xN] and perform the multiplication A * X. If we need to do A1x1, A2x2, …, ANxN then the only solution I can think of is to do A*X with A = blockdiag(A1, A2, …, AN), but this would occupy tremendous amounts of memory!

I’m also quoting the following from the CUDA C Programming guide:


I think there are the following ways to do lots of matrix-vector multiplications in parallel:

  1. Use cuBLAS with different streams (or in different CUDA contexts)
  2. Write a custom kernel - but achieving high efficiency is non-trivial
  3. Use Array Fire Pro (and pay for it)
  4. Maybe Thrust could be useful? The zip iterator maybe?

I would appreciate some suggestions…

You don’t need a cublas handle for each stream. It is not a wrapper for a cuda stream. Not even sure what you mean by that.

A matrix multiplication of any reasonable size can fully occupy the GPU. Even if it did not, several run in parallel should fully occupy the GPU. There is no benefit to trying to run more than 16 CUBLAS streams in parallel. Once the machine is fully utilized, there is no benefit to trying to create additional concurrency.

I believe one may write their own kernel to perform multiple matrix-vector multiplications (on a single stream) and that’s what I’m currently working on. My question is whether there is some nice way to perform hundreds of parallel matrix-vector multiplications using cuBLAS, given of course that they don’t occupy the whole device. I guess cuBlasSetStream is not the right way to go…

If all your matrices Ai , Bi , Ci are the same sizes for all i, then you can use cublasgemmBatched
On Kepler, on DP, if sizes > 128 it is usually better to use cublasDgemm in multiple streams

Hello, I am trying to utilize streams to parallely calculate cublasSdgmm and SVD decomposition in cuSolver, but I found there are no benefit using more stream. Could you help me with this?
The codes are like this:

void cudamalloc_pointer_of_pointer(float **&device_data, float**&host_data, float** src_data, int batchSize,
                                   size_t arraySize) {
  float **tem_data = (float **)malloc(batchSize * sizeof(float *));
  for (int i = 0; i < batchSize; i++) {
    CUDA_CHECK(cudaMalloc((void **)&tem_data[i], arraySize));
    CUDA_CHECK(cudaMemcpy(tem_data[i], src_data[i], arraySize,
  CUDA_CHECK(cudaMalloc((void **)&device_data, batchSize * sizeof(float *)));
  CUDA_CHECK(cudaMemcpy(device_data, tem_data, batchSize * sizeof(float *),
  host_data = tem_data;

void cudamalloc_pointer_of_pointer(float **&device_data, float**&host_data,
                                   int batchSize, size_t arraySize) {
  host_data = (float **)malloc(batchSize * sizeof(float *));
  for (int i = 0; i < batchSize; i++) {
    CUDA_CHECK(cudaMalloc((void **)&(host_data[i]), arraySize));
    CUDA_CHECK(cudaMemset(host_data[i], 0, arraySize));
  CUDA_CHECK(cudaMalloc((void **)&device_data, batchSize * sizeof(float *)));
  CUDA_CHECK(cudaMemcpy(device_data, host_data, batchSize * sizeof(float *),

void batched_dgmm(cublasSideMode_t mode, int batchSize, int m, int n, float** hostA, float** hostx, float** hostC) {
  cudaStream_t *streamArray = 0;
  streamArray = (cudaStream_t*)malloc(batchSize * sizeof(cudaStream_t));
  for (int i = 0; i < batchSize; i++)
        cudaStreamCreateWithFlags(&streamArray[i], cudaStreamNonBlocking));
  cublasHandle_t handle;
  for(int i=0;i<batchSize;i++) {
    cublasSetStream(handle, streamArray[i]);
    cublasSdgmm(handle, mode, m, n, hostA[i], m, hostx[i], n, hostC[i], m);
  for (int i = 0; i < batchSize; i++)

void test_batched_dgmm() {
  clock_t start_time, end_time;
  float **a = NULL, **b = NULL;
  float **hosta = NULL, **hostb = NULL;
  float **c = NULL, **hostc = NULL;
  int batchSize = 1000;
  int m = 58;
  int eigenSize = 3;
  cudamalloc_pointer_of_pointer(a, hosta, batchSize, m * eigenSize * sizeof(float));
  cudamalloc_pointer_of_pointer(b, hostb, batchSize, eigenSize * eigenSize * sizeof(float));

  cudamalloc_pointer_of_pointer(c, hostc, batchSize,
                                m * eigenSize * sizeof(float));
  int count = 0;
  while (count < 10) {
    start_time = clock();
    batched_dgmm(CUBLAS_SIDE_RIGHT, batchSize, m, eigenSize, hosta, hostb,
    end_time = clock();
    printf("%d th: time:%f\n", count, (double)(end_time - start_time) / CLOCKS_PER_SEC);

and I found whether I remove this line of code “cublasSetStream(handle, streamArray[i]);” or not, there is no difference in the time used.
I use Visual Studio 2019 to build the code, and my GPU is RTX 3090.

Could you help me with that too, if convenient? I think I had followed the way described in 7_CUDALibraries\batchCUBLAS\batchCUBLAS.cpp… But there seems to be no improvement. The code and details are just in the post above.

You will not see any performance benefits from streams if you already have a single stream utilizing all available resources.

But while I do the batched_dgmm, I am not doing any other calculation. So you mean a single cublasSdgmm call will consume all the computation resources?

You’ll need to confirm that with Nsight Systems on your particular use case.