Huge Matrices General question about how best to deal with very large matrices >4


This is rather general and I wasn’t sure which section to post in, so apologies if some parts seem like they would be better elsewhere.

Has anyone had experience working with matrices over 4GB in size with CUDA? I am particularly interested in solving linear equations via LU and Cholesky decomposition (but maybe some other algorithm scales better for very large matrices?).

So I have a pile of questions. Starting with the hardware these include:

  • Is the current 4GB limit related to 32bit addresses?
  • If not, are there any plans to release cards with >4GB (my largest matrix is under 32GB)?

On the software side:

  • Is there any existing software that handles this kind of problem?
  • Is there any plausible solution that would be faster than using a single x86 processor when memory size is limited?
  • Related to that, what is the best scaling of memory read/writes I can expect? In other words, if I process an 8GB and then a 16GB matrix, will the number of reads/writes double, or scale to some higher power?
  • I assume the best choice of algorithm is dominated by the issue above - reducing reads/writes to the card memory. Any guidance on what algorithm that would imply?
  • Am I missing some obvious workaround?


I am probably not fit to answer this question, but couldn’t you decompose matrix multiplication into something that can be placed on the gpu. I would imagine you can copy the core idea of strassen’s algorithm

Thanks, that’s the kind of thing I was looking for when I asked if about suitable algorithms. However, the main problem is not multiplication, but decomposition.

And also, of course, the rest of my questions still stand - having a board with 32GB would be much faster than any solution that involves reading and worting, even if it can be divided into a reasonable number of chunks.


  • The 4GB is (like you said) because CUDA uses 32-bit device pointers. I haven’t heard that this was going to be changed any time soon (though I don’t see why nVidia doesn’t just go ahead an put it in, since video cards will have >4GB of memory sooner or later).

  • The GT300 series due out later this year or next year is supposedly using GDDR5, which is a bit cheaper than GDDR3. However, putting >4GB on a card (say 8GB or 16GB) is still going to incur a huge cost for the foreseeable future.

  • The best thing you can probably do on the software size is to figure out (or find) some sort of blocked algorithm for whatever problem you are working on. Then you can split up the problem between however many video cards you need to (or multiple kernel runs on a single card), and your code will easily scale up to new GPU’s with more memory.

I have done some work on LU, the CUDA enabled Linapck can solve any problem size and it will do multiple passes if the matrix does not fit on the GPUs:

You may also want to check out the zero-copy feature of CUDA 2.2. But I am not sure if it can access memories that are >4GB limit…
But I am sure they cant access sizes > 4GB
And, I am not sure if a zero-copy pointer eats up the 32-bit device address space.
But it is dead-slow, I am sure of… I am not sure if it is fast enough to have its latencies hidden…

Just my few cents.

Thanks to everyone who replied (and sorry for not checking back sooner). There was some very useful info here. I’ve sent a message (via this message board thing) to Massimiliano (mfatica) asking for details of how to access the linpack code, but I may as well also ask here just in case anyone else know…?

Thanks again - really useful.


I am doing an LDL^T decomposition using GPUs… If u look @ the problem – not all sections of matrices are required all the time…

May b, you can use such tricks to bring down the footprint.

Maybe you’ve thought about it yourself, but a sparse matrix notation might reduce the memory footprint as well.