I use GeForce GTX 260 & CUDA 3.1.

I have a problem to multiply 2 single-precision real matrices using CUBLAS function cublasSgemm. Dimensions of matrices a litlle bit specific: 3x4 and 4x8120601.

So I have very small performance: 0.4 GFlop/s only (test 1024x1024 matrix multiplying has sufficient performance - 320 GFlop/s).

How can I speed-up the calculation? Maybe I can use some kinds of matrix algebraic decomplosition in blocks? Or something else can accelerate CUBLAS functions?