Query on Matrix Multiply performance when the matrix is very huge

Hi all,

I am facing a situation that the matrix size is bigger than the device memory (12G device,I am using Tesla K20) when to do the matrix multiply operation.

any body has the simiarl experience before?
how to deal with this situation, to be more straighforward, I do think the huge matrix needs to be splitted before memory copy to device from host, any sample I can references?

I am a newbie to CUDA, looking forword to any comments and feedback

Allen Zhang

indeed possible
it involves storing the matrices on the host, and only using sub-matrices on the device
host should lead the process - it controls and tracks the sub-matrices, and put them in place
in general terms, the host keeps the grand matrix in mind; the device is generally oblivious to it
streams may help to aid the process

a matrix multiplication is a per row operation on a matrix multiplied by a transposed matrix
if you follow how a host-device combination can multiply matrices on a row basis, it should not be difficult to comprehend how this can be extended to sub-matrix tiles

the very high level pseudo code should look like:

for a matrix
extract a number of rows from an input matrix (matrices) and copy them to the device [host]
perform desired row operations on the rows [device]
copy the rows back to (an output matrix on) the host in its appropriate location [host]

using streams generally helps to hide the memory transactions
streams also allow events, and events are the easiest method of synchronization/ handshaking i can think of that the host needs to keep track of rows completed, such that it does not flood the device, and knows where to store completed rows

in the above, the one matrix is assumed to be already transposed
if you follow the above, you would use the same principle to transpose the matrix to satisfy the assumption

cublasXt API can handle pretty much everything for you. It should be able to do matrix-matrix multiply on matrices that can fit into system memory. It should not be limited by device memory.


It works with 2 GPUs if they are a dual-GPU board such as Tesla K10, or K80. It can also work with a single GPU:


Thanks, but I am using CUDA 5.5 due to production environment restriction. I am getting cublasXt was introduced in the version of CUDA6.0.