Using CUDA Streams and cublasDsyrk to replace cublasDgemmBatched call results in very low performance


we have to multiply many small matrices with themselves currently we are using cublasDgemmBatched which gives us reasonable performance.
However, because the matrices are multiplied with themselves we want to improve the performance using cublasDsyrk.

We tried to write our own batchedDsyrk using CUDA streams and the cublasDysrk function but we observed extreme low performance with this approach.
With the cublasDgemmBatched version the program takes several seconds to finish, with our custom batchedDsyrk suddenly the program needs several minutes to finish.
We think we are doing something wrong, but we are not able to find our mistake. The matrices have a size of 32x16 (output matrices are square matrices of size s1). We are using the P100 GPU.

The cublasDgemmBatched call looks like this:

cublasDgemmBatched(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, s1, s1, batch_size,
                                   &bgemm_alpha[0], (const double**) d_map, vec_size,
                                                    (const double**) d_map, vec_size,
                                   &bgemm_beta[0] ,                  d_mcp, s1, nm[0]);

Out custom batchedDsyrk looks like this (the streams are only created once before the main loop and then reused several times):

const int stream_n = 128;
cudaStream_t *streams = (cudaStream_t*)malloc(stream_n*sizeof(cudaStream_t));
for(int i=0;i<stream_n;i++)
    CHECK(cudaStreamCreateWithFlags(&streams[i], cudaStreamNonBlocking));

for(int i=0;i<nm[0];i++)
      cublasSetStream(cublas_handle, streams[i % stream_n]);
      cublasDsyrk(cublas_handle, CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_T, s1, batch_size,
                  &bgemm_alpha[0], map[i], vec_size, &bgemm_beta[0], mcp[i], s1);

Thank you for your help

Use the CUDA profiler to guide you. *SYRK is a rank update as I recall, and as such likely limited by memory throughput, whereas *GEMM is limited by the throughput of computational units. Converting a computationally-limited problem into a bandwidth-limited problem is usually not a good idea.

Thank you for your answer.

I checked the profiler. The whole program spends 99.5 percent of the time in the kernel “maxwell_dgemm_64x64_lower_tn”, this shouldn’t be. There must be something wrong in terms of setting it up.
The timeline is full of events like “cudaSetupArgument” which consume several ms.
When I replace cublasDsyrk by cublasDgemm inside the loop I get the same low performance. Something is definitely wrong.

By the way the MAGMA library has a batchedSyrk function changing from gemm to syrk can’t be such a bad idea.

I did not intend to be misleading, it has been almost ten years now that I last dealt with an *SYRK function.

The CUDA profiler is a powerful tool that enables CUDA programmers to zero in on the most important performance bottlenecks rather quickly. It can easily tell you whether a kernel is compute or memory limited. It is worth spending some quality time with to get the hang of it.

I doubt that attempting to do individual syrk operations in cublas and then using streams for parallelism would be anywhere near as efficient as batched operations on small matrices. That’s not been my experience, anyway.

There’s no doubt that a batched syrk might be a good idea. However doing operations in separate streams is not an equivalent replacement for batching in cublas. If that were the case, there would be no reason to provide batched versions of operations like gemm. The simple fact that cublas and magma provide batched operations suggests that attempting to do something similar with streams and non-batched calls, probably doesn’t work (well) for many situations.

It’s not immediately obvious to me that there is anything wrong with your code/attempt. Since it’s not a complete test case that I can try, that’s as far as I can go. Even if you provided a complete test case, I’m not sure I would have the time to spend analyzing it.

In my experience, this is not a fruitful path.

If you have a specific use case that you think is worthwhile, one possible option is to file an enhancement request (bug report) at You can define the batched operation or API that you would like to see, and your use case along with your experiments so far would lend credence to the proposal.

Batching operations on small matrices is crucially necessary to take advantage of the massive parallelism of GPUs when small matrices processed individually offer no way of taking advantage of that parallelism. I can say so with certainty because I was one of the first to research such batched operations when I worked for NVIDIA, and subsequently suggested adding support for batched operations to the CUBLAS team (around 2011, if memory serves).

Thank you for your replay and clarification.

What do you exactly mean by separate streams are not an equivalent replacement for batching. If this is the case then I have a big misunderstanding about how batching works internally.

We followed the NVIDIA example “batchCUBLAS” inside the example directory, in this example multiple streams are used to batch several gemm calls, I thought batchedGemm is doing internally the same but frees the user from stream management. But if this is not the case this may be an explanation for the huge performance degradation.

I was referring specifically to the use of cublasDgemmBatched which you referred to at the beginning of this posting, to deal with small matrices, which you also referred to at the beginning of your posting.

Attempting to use, for example, cublasDgemm and calling that function in multiple streams, while passing a small matrix to each cublasDgemm call, is not equivalent performance-wise, to calling a single instance of cublasDgemmBatched, passing all of the matrices to a single call. You should be able to prove this to yourself with a simple test.

cublasDgemmBatched is not doing internally the work in multiple streams.