Query on Matrix Multiply performance when the matrix is very huge

AllenZhang010 · January 7, 2016, 5:08am

Hi all,

I am facing a situation that the matrix size is bigger than the device memory (12G device,I am using Tesla K20) when to do the matrix multiply operation.

any body has the simiarl experience before?
how to deal with this situation, to be more straighforward, I do think the huge matrix needs to be splitted before memory copy to device from host, any sample I can references?

I am a newbie to CUDA, looking forword to any comments and feedback

Thanks.
Allen Zhang

little_jimmy · January 7, 2016, 6:32am

indeed possible
it involves storing the matrices on the host, and only using sub-matrices on the device
host should lead the process - it controls and tracks the sub-matrices, and put them in place
in general terms, the host keeps the grand matrix in mind; the device is generally oblivious to it
streams may help to aid the process

a matrix multiplication is a per row operation on a matrix multiplied by a transposed matrix
if you follow how a host-device combination can multiply matrices on a row basis, it should not be difficult to comprehend how this can be extended to sub-matrix tiles

the very high level pseudo code should look like:

for a matrix
extract a number of rows from an input matrix (matrices) and copy them to the device [host]
perform desired row operations on the rows [device]
copy the rows back to (an output matrix on) the host in its appropriate location [host]

using streams generally helps to hide the memory transactions
streams also allow events, and events are the easiest method of synchronization/ handshaking i can think of that the host needs to keep track of rows completed, such that it does not flood the device, and knows where to store completed rows

in the above, the one matrix is assumed to be already transposed
if you follow the above, you would use the same principle to transpose the matrix to satisfy the assumption

Robert_Crovella · January 7, 2016, 6:58am

cublasXt API can handle pretty much everything for you. It should be able to do matrix-matrix multiply on matrices that can fit into system memory. It should not be limited by device memory.

[url]http://docs.nvidia.com/cuda/cublas/index.html#unique_235311925[/url]

It works with 2 GPUs if they are a dual-GPU board such as Tesla K10, or K80. It can also work with a single GPU:

[url]http://docs.nvidia.com/cuda/cublas/index.html#cublasxt_deviceSelect[/url]

AllenZhang010 · January 7, 2016, 8:02am

Thanks, but I am using CUDA 5.5 due to production environment restriction. I am getting cublasXt was introduced in the version of CUDA6.0.

Topic		Replies	Views
Large memory Matrix GPU-Accelerated Libraries	9	5281	November 28, 2015
matrix multiplication for large matrices CUDA Programming and Performance	3	1608	August 22, 2011
CUBLAS matrix multiplication matrix size limited by GPU memory size CUDA Programming and Performance	8	3565	August 2, 2010
large matrix multiply Legacy PGI Compilers	0	7727	September 18, 2011
Matrix multiplication performance CUDA Programming and Performance	2	1128	August 3, 2013
problem with big matrix CUDA Programming and Performance	3	1982	August 29, 2008
CUBLAS - low performance on matrix multiplication CUDA Programming and Performance	7	18222	March 30, 2011
How to improve performance when multiply two matrices with large data in CUDA ? CUDA Programming and Performance	5	3952	March 19, 2014
cublasxt auto chunking GPU-Accelerated Libraries	1	856	May 20, 2014
matrix_mul with max_size CUDA Programming and Performance	1	1094	May 21, 2010

Query on Matrix Multiply performance when the matrix is very huge

Related topics