Hi,
We’ve encountered a speed regression in version 6.0 of the SDK (compared to 5.5.22), and I managed to track it down to a particular case of calling cublasSgemm.
In that case, according to the profiler (I used nvprof and nvvp) the actual kernel that is called is not the same:
- in 5.5.22, it is sgemm_sm35_ldg_tn_128x8x256x16x32, around 6ms per call,
- in 6.0, it is sgemm_largek_lds64, around 11.3 ms per call.
I demonstrate a small test case here: https://gist.github.com/lamblin/64a1e72a7f97d395d185
If I compile and run it with Cuda 5:
g++ -mtune=corei7 -march=corei7 -O3 -Ofast -Wall -g -I/opt/cuda-5.5.22/include -L/opt/cuda-5.5.22/lib64 -lcublas -lcudart test_cublas_sgemm.cpp -o test_cublas_sgemm_5 && ./test_cublas_sgemm_5
1000 iterations of Sgemm (1024x256) <- (16384x1024)T . (16384x256), real: 6.075054, cpu: 6.040000
With Cuda 6:
g++ -mtune=corei7 -march=corei7 -O3 -Ofast -Wall -g -I/opt/cuda-6.0/include -L/opt/cuda-6.0/lib64 -lcublas -lcudart test_cublas_sgemm.cpp -o test_cublas_sgemm_6 && LD_LIBRARY_PATH=/opt/cuda-6.0/lib64:$LD_LIBRARY_PATH ./test_cublas_sgemm_5
1000 iterations of Sgemm (1024x256) <- (16384x1024)T . (16384x256), real: 11.537094, cpu: 11.480000
Using the new API (cublas_v2.h) does not make any significant difference.
Setting N=128 instead of 256, or changing the memory layout of A so it does not need to be transposed make the problem go away (presumably because a different kernel gets called).
All tests were done on the same machine:
- Linux (FC 19)
- Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
- 8 Tesla K40c (using only 1 at a time for these tests)
- nvidia-smi reports: NVIDIA-SMI 331.62 Driver Version: 331.62
For the moment, our workaround is to continue using version 5.5. Is there another way?