CUBLAS SGEMM on highly rectangular matrices

GTX 260, CUDA 2.3

I’ve been attempting to use a CUBLAS SGEMM to replace a hand-written kernel that is doing a bunch of weighted sum reductions over a large number of data points.

The data in question though is highly rectangular, on the order of [2^6 to 2^10 x 2^16 to 2^20] * [2^16 to 2^20 x 16]

Increasing the larger dimension has a essentially directly proportional increase in the execution time - which makes sense.
However the performance is essentially flat as the smaller dimension increases, and then at certain values (around 896 in this example) makes a large jump and the execution time doubles.

Is this normal? I assume it must be a sub-optimal block strategy (probably far too coarse since one dimension is so big relative to the other and it uses square blocks?) causing the device to be underutilized - how else could the execution time remain almost constant as one of the dimensions increases? Anyone with more knowledge on the inner-workings of CUBLAS able to comment?

I’ve read that MAGMA (http://icl.cs.utk.edu/magma/) is supposed to be better optimized for rectangular matrices but I could only find benchmarks with square matrices - does anyone have any experience with that?

Well I played around with the CUDA visual profiler a bit - seems my suspicions were correct.
It uses a very coarse grid of blocks. Up until M=768 or so (just before the first “step”) it is using 24 or less blocks - so not even using all of the SMs. Even at 24 blocks it’s peaking at “only” 80 GFLOPS (assuming 2MN*K flops?) and 15 GB/sec memory throughput (48 blocks doesn’t help since its using 512 threads and has 50% occupacy).