Hundreds of parallel matrix-vector multiplications with cuBLAS

chung · October 13, 2014, 10:36pm

To my understanding a cuBLAD handle (cublasHandle_t) is a wrapper for a cuda stream (cudaStream_t); this said, if we attempt to launch hundreds of parallel matrix-vector multiplications using cuBLAS and constructing the corresponding handles, these will be serialized in groups of 16, i.e., only 16 parallel cuBLAS kernels will actually execute in parallel. So, if this is true, is there a way (possibly using some other library) to perform lots of (actually) parallel matrix-vector multiplications?

Note: If the multiplications we need to perform in parallel are in the form Ax1, Ax2, …, AxN, then we can simply pack x1, x2, …, xN is a matrix X = [x1, x2, …, xN] and perform the multiplication A * X. If we need to do A1x1, A2x2, …, ANxN then the only solution I can think of is to do A*X with A = blockdiag(A1, A2, …, AN), but this would occupy tremendous amounts of memory!

I’m also quoting the following from the CUDA C Programming guide:

UPDATE…

I think there are the following ways to do lots of matrix-vector multiplications in parallel:

Use cuBLAS with different streams (or in different CUDA contexts)
Write a custom kernel - but achieving high efficiency is non-trivial
Use Array Fire Pro (and pay for it)
Maybe Thrust could be useful? The zip iterator maybe?

I would appreciate some suggestions…

Robert_Crovella · October 14, 2014, 11:21pm

You don’t need a cublas handle for each stream. It is not a wrapper for a cuda stream. Not even sure what you mean by that.

A matrix multiplication of any reasonable size can fully occupy the GPU. Even if it did not, several run in parallel should fully occupy the GPU. There is no benefit to trying to run more than 16 CUBLAS streams in parallel. Once the machine is fully utilized, there is no benefit to trying to create additional concurrency.

chung · October 15, 2014, 8:57am

I believe one may write their own kernel to perform multiple matrix-vector multiplications (on a single stream) and that’s what I’m currently working on. My question is whether there is some nice way to perform hundreds of parallel matrix-vector multiplications using cuBLAS, given of course that they don’t occupy the whole device. I guess cuBlasSetStream is not the right way to go…

philippev · October 30, 2014, 5:50am

If all your matrices Ai , Bi , Ci are the same sizes for all i, then you can use cublasgemmBatched
On Kepler, on DP, if sizes > 128 it is usually better to use cublasDgemm in multiple streams

758370661 · April 7, 2021, 11:29am

Hello, I am trying to utilize streams to parallely calculate cublasSdgmm and SVD decomposition in cuSolver, but I found there are no benefit using more stream. Could you help me with this?
The codes are like this:

void cudamalloc_pointer_of_pointer(float **&device_data, float**&host_data, float** src_data, int batchSize,
                                   size_t arraySize) {
  float **tem_data = (float **)malloc(batchSize * sizeof(float *));
  for (int i = 0; i < batchSize; i++) {
    CUDA_CHECK(cudaMalloc((void **)&tem_data[i], arraySize));
    CUDA_CHECK(cudaMemcpy(tem_data[i], src_data[i], arraySize,
                          cudaMemcpyHostToDevice));
  }
  CUDA_CHECK(cudaMalloc((void **)&device_data, batchSize * sizeof(float *)));
  CUDA_CHECK(cudaMemcpy(device_data, tem_data, batchSize * sizeof(float *),
                        cudaMemcpyHostToDevice));
  //free(tem_data);
  host_data = tem_data;
}

void cudamalloc_pointer_of_pointer(float **&device_data, float**&host_data,
                                   int batchSize, size_t arraySize) {
  host_data = (float **)malloc(batchSize * sizeof(float *));
  for (int i = 0; i < batchSize; i++) {
    CUDA_CHECK(cudaMalloc((void **)&(host_data[i]), arraySize));
    CUDA_CHECK(cudaMemset(host_data[i], 0, arraySize));
  }
  CUDA_CHECK(cudaDeviceSynchronize());
  CUDA_CHECK(cudaMalloc((void **)&device_data, batchSize * sizeof(float *)));
  CUDA_CHECK(cudaMemcpy(device_data, host_data, batchSize * sizeof(float *),
                        cudaMemcpyHostToDevice));
}

void batched_dgmm(cublasSideMode_t mode, int batchSize, int m, int n, float** hostA, float** hostx, float** hostC) {
  cudaStream_t *streamArray = 0;
  streamArray = (cudaStream_t*)malloc(batchSize * sizeof(cudaStream_t));
  for (int i = 0; i < batchSize; i++)
    CUDA_CHECK(
        cudaStreamCreateWithFlags(&streamArray[i], cudaStreamNonBlocking));
  cublasHandle_t handle;
  CUBLAS_CHECK(cublasCreate(&handle));
  for(int i=0;i<batchSize;i++) {
    cublasSetStream(handle, streamArray[i]);
    cublasSdgmm(handle, mode, m, n, hostA[i], m, hostx[i], n, hostC[i], m);
  }
  CUDA_CHECK(cudaDeviceSynchronize());
  cublasDestroy(handle);
  for (int i = 0; i < batchSize; i++)
    CUDA_CHECK(cudaStreamDestroy(streamArray[i]));
  free(streamArray);
}

void test_batched_dgmm() {
  clock_t start_time, end_time;
  float **a = NULL, **b = NULL;
  float **hosta = NULL, **hostb = NULL;
  float **c = NULL, **hostc = NULL;
  int batchSize = 1000;
  int m = 58;
  int eigenSize = 3;
  cudamalloc_pointer_of_pointer(a, hosta, batchSize, m * eigenSize * sizeof(float));
  cudamalloc_pointer_of_pointer(b, hostb, batchSize, eigenSize * eigenSize * sizeof(float));

  cudamalloc_pointer_of_pointer(c, hostc, batchSize,
                                m * eigenSize * sizeof(float));
  int count = 0;
  while (count < 10) {
    start_time = clock();
    batched_dgmm(CUBLAS_SIDE_RIGHT, batchSize, m, eigenSize, hosta, hostb,
                 hostc);
    end_time = clock();
    printf("%d th: time:%f\n", count, (double)(end_time - start_time) / CLOCKS_PER_SEC);
  
  }
}

and I found whether I remove this line of code “cublasSetStream(handle, streamArray[i]);” or not, there is no difference in the time used.
I use Visual Studio 2019 to build the code, and my GPU is RTX 3090.

758370661 · April 7, 2021, 11:35am

Could you help me with that too, if convenient? I think I had followed the way described in 7_CUDALibraries\batchCUBLAS\batchCUBLAS.cpp… But there seems to be no improvement. The code and details are just in the post above.

mnicely · April 7, 2021, 1:01pm

You will not see any performance benefits from streams if you already have a single stream utilizing all available resources.

758370661 · April 8, 2021, 5:28am

But while I do the batched_dgmm, I am not doing any other calculation. So you mean a single cublasSdgmm call will consume all the computation resources?

mnicely · April 8, 2021, 12:26pm

You’ll need to confirm that with Nsight Systems on your particular use case.

Topic		Replies	Views
cublasZgemmBatched low performance 2x2 matrices; how to increase performance? GPU-Accelerated Libraries	9	1299	February 20, 2015
cusparse concurrency using streams CUDA Programming and Performance	3	1717	July 19, 2013
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10034	January 20, 2010
Use streams to batch the execution of CUBLAS kernels. CUDA Programming and Performance	3	2189	June 15, 2011
Concurrent kernels on Kepler CUDA Programming and Performance	8	1055	February 23, 2014
Does CUBLAS 4 RC-2 support using multiple contexts from a single host-thread? CUDA Programming and Performance	11	10619	August 19, 2011
cuBLAS handle creation fails CUDA Programming and Performance	1	504	June 13, 2022
Maximum matrix size for matrix multiplication operation on GeForce GTX 960M CUDA Programming and Performance	12	3671	November 28, 2018
Strange Variations in Execution Time of cublas<t>geam() [cublasDgemm] GPU-Accelerated Libraries cublas	9	955	September 2, 2021
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5590	November 25, 2009

Hundreds of parallel matrix-vector multiplications with cuBLAS

Related topics