Hi All , :)
I need to speedup BLAS library functions (matrix multiplication) like sgemm, dgemm and zgemm using GPU. I have used matrix multiplication implementation available with CUBLAS, but problem with CUBLAS is matrix has to be in GPU memory. This limits the usable matrix size (5K) due to limited GPU memory. Has anyone faced this problem before ? Do you know any implementation which runs on GPU but takes input from CPU memory and hence can run with really big matrix sizes (10-20k).

You can decompose large matrices into smaller blocks which will fit into GPU memory, and compute the submatrix products separately on the GPU, assembling the final result in host memory as you go along. CUBLAS can be used to do this without needing any extra device code, you just need a little bit of host code to determine the indices and offsets to copy to GPU memory for the block operations.

If you can’t store matrices of more than 5K x 5K elements, I guess you have a card with less than 256MB, so probably it’s not one of the latest models. I’ve been doing some tests with a Tesla C1060 and the cuBLAS functions, and the performance of dgemm is just about 2x better (in the best cases) than that of the performance of the dgemm function from the Intel MKL library. So perhaps, considering that you don’t have a top GPU, you could get better results using the Intel compilers with MKL libraries and running the tests on the CPU.

Other cuBLAS functions, like matrix-vector ones, they do perform quite better than MKL ones, specially if you don’t need to make CPU-GPU-CPU transit all the times.

For single precision gemm, a 5kx5k call would require 300Mb of space for the three matrices, 600Mb for double precision, so I would guess the original poster has either a 512Mb or 896Mb card. On a host with plenty of memory I have successfully computed 30kx30k double precision gemm using an 896Mb GTX275 at about 80 Gflop/s. So big problems can be done on small cards with a little lateral thinking.

Thanks for the link. I have a little C library for doing a multi-gpu version of what I am guessing is pretty much the same thing in my much talked about, but still upublished linpack port. It might be interesting to compare notes…

Yep, my fault, I just considered one matrix, instead three of them :"> . In that case I agree with the recommendation of trying to decompose original matrix in blocks.