Using CUDA Streams and cublasDsyrk to replace cublasDgemmBatched call results in very low performance

Joseph_A · November 13, 2018, 4:17am

Hi,

we have to multiply many small matrices with themselves currently we are using cublasDgemmBatched which gives us reasonable performance.
However, because the matrices are multiplied with themselves we want to improve the performance using cublasDsyrk.

We tried to write our own batchedDsyrk using CUDA streams and the cublasDysrk function but we observed extreme low performance with this approach.
With the cublasDgemmBatched version the program takes several seconds to finish, with our custom batchedDsyrk suddenly the program needs several minutes to finish.
We think we are doing something wrong, but we are not able to find our mistake. The matrices have a size of 32x16 (output matrices are square matrices of size s1). We are using the P100 GPU.

The cublasDgemmBatched call looks like this:

cublasDgemmBatched(cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, s1, s1, batch_size,
                                   &bgemm_alpha[0], (const double**) d_map, vec_size,
                                                    (const double**) d_map, vec_size,
                                   &bgemm_beta[0] ,                  d_mcp, s1, nm[0]);

Out custom batchedDsyrk looks like this (the streams are only created once before the main loop and then reused several times):

const int stream_n = 128;
cudaStream_t *streams = (cudaStream_t*)malloc(stream_n*sizeof(cudaStream_t));
for(int i=0;i<stream_n;i++)
{
    CHECK(cudaStreamCreateWithFlags(&streams[i], cudaStreamNonBlocking));
}

for(int i=0;i<nm[0];i++)
{
      cublasSetStream(cublas_handle, streams[i % stream_n]);
      cublasDsyrk(cublas_handle, CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_T, s1, batch_size,
                  &bgemm_alpha[0], map[i], vec_size, &bgemm_beta[0], mcp[i], s1);
}
cudaDeviceSynchronize();
cublasSetStream(cublas_handle,NULL);

Thank you for your help

njuffa · November 13, 2018, 9:41am

Use the CUDA profiler to guide you. *SYRK is a rank update as I recall, and as such likely limited by memory throughput, whereas *GEMM is limited by the throughput of computational units. Converting a computationally-limited problem into a bandwidth-limited problem is usually not a good idea.

Joseph_A · November 14, 2018, 1:37am

Thank you for your answer.

I checked the profiler. The whole program spends 99.5 percent of the time in the kernel “maxwell_dgemm_64x64_lower_tn”, this shouldn’t be. There must be something wrong in terms of setting it up.
The timeline is full of events like “cudaSetupArgument” which consume several ms.
When I replace cublasDsyrk by cublasDgemm inside the loop I get the same low performance. Something is definitely wrong.

By the way the MAGMA library has a batchedSyrk function changing from gemm to syrk can’t be such a bad idea.

njuffa · November 14, 2018, 1:55am

I did not intend to be misleading, it has been almost ten years now that I last dealt with an *SYRK function.

The CUDA profiler is a powerful tool that enables CUDA programmers to zero in on the most important performance bottlenecks rather quickly. It can easily tell you whether a kernel is compute or memory limited. It is worth spending some quality time with to get the hang of it.

Robert_Crovella · November 14, 2018, 2:30am

I doubt that attempting to do individual syrk operations in cublas and then using streams for parallelism would be anywhere near as efficient as batched operations on small matrices. That’s not been my experience, anyway.

There’s no doubt that a batched syrk might be a good idea. However doing operations in separate streams is not an equivalent replacement for batching in cublas. If that were the case, there would be no reason to provide batched versions of operations like gemm. The simple fact that cublas and magma provide batched operations suggests that attempting to do something similar with streams and non-batched calls, probably doesn’t work (well) for many situations.

It’s not immediately obvious to me that there is anything wrong with your code/attempt. Since it’s not a complete test case that I can try, that’s as far as I can go. Even if you provided a complete test case, I’m not sure I would have the time to spend analyzing it.

In my experience, this is not a fruitful path.

If you have a specific use case that you think is worthwhile, one possible option is to file an enhancement request (bug report) at developer.nvidia.com. You can define the batched operation or API that you would like to see, and your use case along with your experiments so far would lend credence to the proposal.

njuffa · November 14, 2018, 2:42am

Batching operations on small matrices is crucially necessary to take advantage of the massive parallelism of GPUs when small matrices processed individually offer no way of taking advantage of that parallelism. I can say so with certainty because I was one of the first to research such batched operations when I worked for NVIDIA, and subsequently suggested adding support for batched operations to the CUBLAS team (around 2011, if memory serves).

Joseph_A · November 14, 2018, 4:12am

Robert_Crovella:

I doubt that attempting to do individual syrk operations in cublas and then using streams for parallelism would be anywhere near as efficient as batched operations on small matrices. That’s not been my experience, anyway.

There’s no doubt that a batched syrk might be a good idea. However doing operations in separate streams is not an equivalent replacement for batching in cublas. If that were the case, there would be no reason to provide batched versions of operations like gemm. The simple fact that cublas and magma provide batched operations suggests that attempting to do something similar with streams and non-batched calls, probably doesn’t work (well) for many situations.

It’s not immediately obvious to me that there is anything wrong with your code/attempt. Since it’s not a complete test case that I can try, that’s as far as I can go. Even if you provided a complete test case, I’m not sure I would have the time to spend analyzing it.

In my experience, this is not a fruitful path.

If you have a specific use case that you think is worthwhile, one possible option is to file an enhancement request (bug report) at developer.nvidia.com. You can define the batched operation or API that you would like to see, and your use case along with your experiments so far would lend credence to the proposal.

Thank you for your replay and clarification.

What do you exactly mean by separate streams are not an equivalent replacement for batching. If this is the case then I have a big misunderstanding about how batching works internally.

We followed the NVIDIA example “batchCUBLAS” inside the example directory, in this example multiple streams are used to batch several gemm calls, I thought batchedGemm is doing internally the same but frees the user from stream management. But if this is not the case this may be an explanation for the huge performance degradation.

Robert_Crovella · November 14, 2018, 1:22pm

I was referring specifically to the use of cublasDgemmBatched which you referred to at the beginning of this posting, to deal with small matrices, which you also referred to at the beginning of your posting.

Attempting to use, for example, cublasDgemm and calling that function in multiple streams, while passing a small matrix to each cublasDgemm call, is not equivalent performance-wise, to calling a single instance of cublasDgemmBatched, passing all of the matrices to a single call. You should be able to prove this to yourself with a simple test.

cublasDgemmBatched is not doing internally the work in multiple streams.

Topic		Replies	Views
Use streams to batch the execution of CUBLAS kernels. CUDA Programming and Performance	3	2202	June 15, 2011
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	951	August 23, 2018
cublasZgemmBatched low performance 2x2 matrices; how to increase performance? GPU-Accelerated Libraries	9	1307	February 20, 2015
Syrkx is much slower in some cases than gemm GPU-Accelerated Libraries	3	290	January 12, 2024
cublasSgemmBatched CUDA Programming and Performance	2	1940	May 30, 2015
Pro Tip: cuBLAS Strided Batched Matrix Multiply Technical Blog	11	938	February 16, 2018
Why performance is worse with CUBLAS- than with kernel-function GPU-Accelerated Libraries	3	1008	September 5, 2019
cublas and streams CUDA Programming and Performance	4	5678	June 2, 2010
Newbie question about cublas CUDA Programming and Performance	10	3357	December 2, 2010
How to concurrent cublas-sgemm by stream? CUDA Programming and Performance	3	6321	February 24, 2017

Using CUDA Streams and cublasDsyrk to replace cublasDgemmBatched call results in very low performance

Related topics