Speed regression for a pattern of sgemm in cuBLAS6


We’ve encountered a speed regression in version 6.0 of the SDK (compared to 5.5.22), and I managed to track it down to a particular case of calling cublasSgemm.
In that case, according to the profiler (I used nvprof and nvvp) the actual kernel that is called is not the same:

  • in 5.5.22, it is sgemm_sm35_ldg_tn_128x8x256x16x32, around 6ms per call,
  • in 6.0, it is sgemm_largek_lds64, around 11.3 ms per call.

I demonstrate a small test case here: https://gist.github.com/lamblin/64a1e72a7f97d395d185
If I compile and run it with Cuda 5:

g++ -mtune=corei7 -march=corei7 -O3 -Ofast -Wall -g  -I/opt/cuda-5.5.22/include -L/opt/cuda-5.5.22/lib64 -lcublas -lcudart test_cublas_sgemm.cpp -o test_cublas_sgemm_5 && ./test_cublas_sgemm_5
1000 iterations of Sgemm (1024x256) <- (16384x1024)T . (16384x256), real: 6.075054, cpu: 6.040000

With Cuda 6:

g++ -mtune=corei7 -march=corei7 -O3 -Ofast -Wall -g  -I/opt/cuda-6.0/include -L/opt/cuda-6.0/lib64 -lcublas -lcudart test_cublas_sgemm.cpp -o test_cublas_sgemm_6 && LD_LIBRARY_PATH=/opt/cuda-6.0/lib64:$LD_LIBRARY_PATH ./test_cublas_sgemm_5
1000 iterations of Sgemm (1024x256) <- (16384x1024)T . (16384x256), real: 11.537094, cpu: 11.480000

Using the new API (cublas_v2.h) does not make any significant difference.

Setting N=128 instead of 256, or changing the memory layout of A so it does not need to be transposed make the problem go away (presumably because a different kernel gets called).

All tests were done on the same machine:

  • Linux (FC 19)
  • Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
  • 8 Tesla K40c (using only 1 at a time for these tests)
  • nvidia-smi reports: NVIDIA-SMI 331.62 Driver Version: 331.62

For the moment, our workaround is to continue using version 5.5. Is there another way?

Since you already have a ready-to-use repro case in hand, I would suggest filing a bug report through the form linked from the registered developer website. According to your observations, it seems that the heuristic that selects which SGEMM kernel to run has changed between CUDA 5.5 and CUDA 6.0 and may need re-tweaking. As your further experiments demonstrate, such a heuristic is a multi-dimensional problem dependent on architecture, dimensions, and transpose modes, and while it is unlikely that any heuristic will ever be optimal for all possible combinations, it is certainly useful to take another look. Thank you for hour help.

In practical terms, I see no reason not to stick with CUDA 5.5 for now, unless you have definite needs for one of the new CUDA 6.0 features, such as unified memory. You may want to try the CUDA 6.5 RC when it becomes available.

Thanks for your reply.
I just applied for the registered developer program, so I can submit a proper bug report.

The main drawback against CUDA 5.5 is that I don’t think it fully supports gcc 4.8, which is the default on FC 19.

Approval for the registered developer program typical takes one business day, but since today is a holiday in the US the next business day is Tuesday 5/27. Thanks for filing the bug.

According to the CUDA 5.5 Release Notes, it supports FC16 with gcc 4.6.2 (other versions may work but this cannot be guaranteed):


CUDA 6.0 on the other hand supports FC19 with gcc 4.8.1:


The bug is now opened at https://developer.nvidia.com/nvbugs/cuda/edit/1518004
Thanks for the help.