CUBLAS matrix multiplication matrix size limited by GPU memory size

sac · July 14, 2010, 5:13am

Hi All , :)
I need to speedup BLAS library functions (matrix multiplication) like sgemm, dgemm and zgemm using GPU. I have used matrix multiplication implementation available with CUBLAS, but problem with CUBLAS is matrix has to be in GPU memory. This limits the usable matrix size (5K) due to limited GPU memory. Has anyone faced this problem before ? Do you know any implementation which runs on GPU but takes input from CPU memory and hence can run with really big matrix sizes (10-20k).

Best Regards,
Sachin

avidday · July 14, 2010, 6:39am

You can decompose large matrices into smaller blocks which will fit into GPU memory, and compute the submatrix products separately on the GPU, assembling the final result in host memory as you go along. CUBLAS can be used to do this without needing any extra device code, you just need a little bit of host code to determine the indices and offsets to copy to GPU memory for the block operations.

marcuse · July 14, 2010, 12:51pm

If you can’t store matrices of more than 5K x 5K elements, I guess you have a card with less than 256MB, so probably it’s not one of the latest models. I’ve been doing some tests with a Tesla C1060 and the cuBLAS functions, and the performance of dgemm is just about 2x better (in the best cases) than that of the performance of the dgemm function from the Intel MKL library. So perhaps, considering that you don’t have a top GPU, you could get better results using the Intel compilers with MKL libraries and running the tests on the CPU.

Other cuBLAS functions, like matrix-vector ones, they do perform quite better than MKL ones, specially if you don’t need to make CPU-GPU-CPU transit all the times.

avidday · July 14, 2010, 1:44pm

For single precision gemm, a 5kx5k call would require 300Mb of space for the three matrices, 600Mb for double precision, so I would guess the original poster has either a 512Mb or 896Mb card. On a host with plenty of memory I have successfully computed 30kx30k double precision gemm using an 896Mb GTX275 at about 80 Gflop/s. So big problems can be done on small cards with a little lateral thinking.

YDD · July 14, 2010, 2:07pm

I offer SciGPU-GEMM, a little library for doing exactly this.

YDD · July 14, 2010, 2:08pm

I offer SciGPU-GEMM, a little library for doing exactly this.

avidday · July 14, 2010, 3:28pm

Thanks for the link. I have a little C library for doing a multi-gpu version of what I am guessing is pretty much the same thing in my much talked about, but still upublished linpack port. It might be interesting to compare notes…

marcuse · July 15, 2010, 7:00am

Yep, my fault, I just considered one matrix, instead three of them :"> . In that case I agree with the recommendation of trying to decompose original matrix in blocks.

sac · August 2, 2010, 6:20am

Thanks. Used SciGPU to develop GPU BLAS library (with just matrix multiplication).

Topic		Replies	Views
CUBLAS Configuration The use of CUBLAS for small matrix CUDA Programming and Performance	3	3723	April 4, 2007
Large memory Matrix GPU-Accelerated Libraries	9	5180	November 28, 2015
CGEMM problems CUDA Programming and Performance	14	6633	February 2, 2011
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18188	March 30, 2011
speed up "thin" matrix multiplications in cuBLAS CUDA Programming and Performance	4	1704	January 29, 2016
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5561	November 25, 2009
A few Questions related to CUDA and CUBLAS CUDA Programming and Performance	0	908	February 1, 2013
CUBlas and very large matrices CUDA Programming and Performance	3	827	September 30, 2019
cuBLAS related question CUDA Programming and Performance	16	2862	February 6, 2013
CUBLAS sgemv slower than CBLAS for small matrix sizes CUDA Programming and Performance	2	1501	February 1, 2010

CUBLAS matrix multiplication matrix size limited by GPU memory size

Related topics