CUBLAS SGEMM on highly rectangular matrices

apangborn · February 18, 2010, 10:57pm

GTX 260, CUDA 2.3

I’ve been attempting to use a CUBLAS SGEMM to replace a hand-written kernel that is doing a bunch of weighted sum reductions over a large number of data points.

The data in question though is highly rectangular, on the order of [2^6 to 2^10 x 2^16 to 2^20] * [2^16 to 2^20 x 16]

Increasing the larger dimension has a essentially directly proportional increase in the execution time - which makes sense.
However the performance is essentially flat as the smaller dimension increases, and then at certain values (around 896 in this example) makes a large jump and the execution time doubles.

External Media

Is this normal? I assume it must be a sub-optimal block strategy (probably far too coarse since one dimension is so big relative to the other and it uses square blocks?) causing the device to be underutilized - how else could the execution time remain almost constant as one of the dimensions increases? Anyone with more knowledge on the inner-workings of CUBLAS able to comment?

I’ve read that MAGMA (http://icl.cs.utk.edu/magma/) is supposed to be better optimized for rectangular matrices but I could only find benchmarks with square matrices - does anyone have any experience with that?

apangborn · February 20, 2010, 9:27am

Well I played around with the CUDA visual profiler a bit - seems my suspicions were correct.
It uses a very coarse grid of blocks. Up until M=768 or so (just before the first “step”) it is using 24 or less blocks - so not even using all of the SMs. Even at 24 blocks it’s peaking at “only” 80 GFLOPS (assuming 2MN*K flops?) and 15 GB/sec memory throughput (48 blocks doesn’t help since its using 512 threads and has 50% occupacy).

Topic		Replies	Views
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3727	April 4, 2007
Low CuBLAS performance CUDA Programming and Performance	3	439	January 15, 2019
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1504	February 1, 2010
Cuda SGEMM same speed as APPLE veclibs ? CUDA Programming and Performance	8	10619	May 8, 2008
CGEMM problems CUDA Programming and Performance	14	6639	February 2, 2011
A few Questions related to CUDA and CUBLAS CUDA Programming and Performance	0	910	February 1, 2013
Compiling under CUDA 5.5 uses unnecessary global memory CUDA Programming and Performance	10	2007	August 13, 2013
Why performance is worse with CUBLAS- than with kernel-function GPU-Accelerated Libraries	3	969	September 5, 2019
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10040	March 24, 2014
SGEMM and SGEMV - large performance difference in cuBLAS CUDA Programming and Performance	1	336	April 7, 2024

CUBLAS SGEMM on highly rectangular matrices

Related topics