What's the best matrix size for cublasSgemm performance ?

gu_xiangtao · February 16, 2017, 8:41am

I’m working on DNN optimize,most of them are matrix multiplication.
I test different size square matrix by clbasSgemm().
I test in a GTX1080 board with cuda 8.0. I find the different matrix size Ｎ has different performance .
when N <512, do 1000 times N size matrix mul ,used time : (0-3)ms,time increase with N peacefully.
but when N=513 ,the time increase to 80ms .Then about N increase 100 the time will come a new high level.

1.what’s influence the critical matrix size for cblasSgemm() performance? Device memory or compute unite resource?

2.How to tune matrix size for best performance.

3.When use multiple stream, how to tune matrix size to make sure multiple stream with sgemm compute concurrent on one gpu card.

please don’t care my poor English.

Thanks.

Here is my test result table:

njuffa · February 16, 2017, 5:41pm

A typical graph plotting performance vs size for any BLAS library shows a sawtooth pattern, due to various granularity effects in the tiling (blocking) used. In general SGEMM is a compute-bound task.

I am reasonably sure that NVIDIA gives guidance on (S)GEMM performance somewhere in the documentation. From memory, for best performance (as measured in GFLOPS) the matrices should be

(1) square
(2) dimensions multiple of 32
(3) large (but performance plateaus for dimension > 4K x 4K)

There is some performance impact from transposition mode, but typically < 10%. If you have a choice there, worth experimenting. I seem to recall that transpose-B, notranspose-A is often the fastest, but my memory is hazy.

Topic		Replies	Views
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3752	April 4, 2007
cublas sgemm benchmarks CUDA Programming and Performance	1	3670	July 9, 2008
CUBLAS 3.0 DGEMM performance on Tesla Fermi CUDA Programming and Performance	1	11441	May 14, 2010
cublas large matrix multiplication large matrices won't compute CUDA Programming and Performance	4	3546	January 17, 2008
Reasonable timing with Cublas dgemm and sgemm CUDA Programming and Performance	15	4316	January 14, 2010
Low CuBLAS performance CUDA Programming and Performance	3	467	January 15, 2019
CUBLAS matrix multiplication matrix size limited by GPU memory size CUDA Programming and Performance	8	3565	August 2, 2010
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1530	February 1, 2010
cuBLAS sgemm is slow CUDA Programming and Performance	4	2506	June 26, 2017
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6612	April 16, 2007

What's the best matrix size for cublasSgemm performance ?

Related topics