This is rather general and I wasn’t sure which section to post in, so apologies if some parts seem like they would be better elsewhere.
Has anyone had experience working with matrices over 4GB in size with CUDA? I am particularly interested in solving linear equations via LU and Cholesky decomposition (but maybe some other algorithm scales better for very large matrices?).
So I have a pile of questions. Starting with the hardware these include:
- Is the current 4GB limit related to 32bit addresses?
- If not, are there any plans to release cards with >4GB (my largest matrix is under 32GB)?
On the software side:
- Is there any existing software that handles this kind of problem?
- Is there any plausible solution that would be faster than using a single x86 processor when memory size is limited?
- Related to that, what is the best scaling of memory read/writes I can expect? In other words, if I process an 8GB and then a 16GB matrix, will the number of reads/writes double, or scale to some higher power?
- I assume the best choice of algorithm is dominated by the issue above - reducing reads/writes to the card memory. Any guidance on what algorithm that would imply?
- Am I missing some obvious workaround?