GTX 260, CUDA 2.3

I’ve been attempting to use a CUBLAS SGEMM to replace a hand-written kernel that is doing a bunch of weighted sum reductions over a large number of data points.

The data in question though is highly rectangular, on the order of [2^6 to 2^10 x 2^16 to 2^20] * [2^16 to 2^20 x 16]

Increasing the larger dimension has a essentially directly proportional increase in the execution time - which makes sense.

However the performance is essentially flat as the smaller dimension increases, and then at certain values (around 896 in this example) makes a large jump and the execution time doubles.

Is this normal? I assume it must be a sub-optimal block strategy (probably far too coarse since one dimension is so big relative to the other and it uses square blocks?) causing the device to be underutilized - how else could the execution time remain almost constant as one of the dimensions increases? Anyone with more knowledge on the inner-workings of CUBLAS able to comment?

I’ve read that MAGMA (http://icl.cs.utk.edu/magma/) is supposed to be better optimized for rectangular matrices but I could only find benchmarks with square matrices - does anyone have any experience with that?