Large memory Matrix

I am using http://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-gemm to do matrix by matrix multiplication.

But my matrix is large and i keep getting out of memory errors. Is there an algorithm or a way to get around it?

I was thinking of using smaller matrices and then add all the small multiples together? Would that work or is there a better way to do this?

I don’t know how large your matrices are, but arithmetically matrix multiply can be decomposed into sub-problems. The beginning part of the answer here:

http://stackoverflow.com/questions/9250897/dynamic-matrix-multiplication-with-cuda/9261675#9261675

discusses how a matrix multiply can be decomposed into a set of “smaller” problems.

cublasXt can handle this for you, ie. allow you to perform a matrix multiply on a problem that will fit in CPU memory but not in GPU memory:

http://docs.nvidia.com/cuda/cublas/index.html#unique_235311925
http://docs.nvidia.com/cuda/cublas/index.html#cublasxt_gemm

It will decompose the problem under the hood for you.

txbob,

My array size is around 21600000 bytes.

I will look into cublasxt_gemm.

What kind of GPU do you use? 21 MB is not particularly large as matrices go, and with even budget GPUs offering at least 1 GB of on-board memory, you should be able to keep several such matrices resident on the GPU. You may want to look more closely at the memory management performed by your application.

Agreed. your out of memory problem should be investigated rather than looking for a different library.

If you have an out of memory issue that is actually due to this matrix size (perhaps because you have ~50 such matrices in memory) then the correct solution would be to manage that situation somehow. cufftXt won’t solve any problems like that for you.

I have a Tesla C2050. I believe it has 3GB of memory. I believe the problem is the result matrix is too large to fit into Memory.

There are a number of different Tesla GPUs with different amounts of memory, but it is reasonably safe to assume that your Tesla has at least 4 GB of on-board memory and thus enough memory for many instances of a 21 MB matrix. Since you have not shown any code that would allow others to reproduce your issue, I can only give the general recommendations to

  1. review the number of memory allocations, and the size of each
  2. properly check the return status of each CUDA API call, each CUBLAS API call, and each kernel launch

Would cublasXt actually take a very large matrix split it up into streams and compute the result for you? This was my impression.

Yes, but a 21MB matrix is not a very large matrix. There would be no need or reason to split it up. Rather than pursuing this path, I would suggest getting a very crisp understanding of the out of memory issue. A 21MB matrix cannot by itself cause an out of memory issue on any GPU.

21MB << 3GB

After doing some profiling, it seems I am taking over 3GB of memory for the initial matrix. When I do a transpose I take atleast double the memory.

The host itself has 128GB of memory but the GPU has 3GB.