How to perform eigenvalue decompositions for large matrices with Python

I have dense, symmetric matrices of the size ~5e4x5e4 to 1e5x1e5 that I want to compute the eigenvalues of. I have looked at CuPy (cupy.linalg.eigvalsh), but I run out of memory on a single GPU when using this. I know cuSOLVER has a Multi-GPU extension; are there any Python libraries that wrap it? Or is there some other way to go about this?

1 Like

Rent the hardware from one of the big cloud providers, maybe? For example, AWS appears to offer instances with A100 and H100 acceleration. I don’t know what the pricing looks like. Since these instances are probably popular with the AI crowd, I would imagine fairly stiff pricing.

Running out of memory is always bad for linear algebra. Are the matrices sparse or banded?
You probably want to calculate the matrices with double precision?
You have to find a suitable algorithm for your matrix type and numerical stability requirements.

You would have to find a way to calculate the constituent operations on multiple GPUs in a shared way.

Can you accept the longer execution time due to more memory transfers?

Workstation graphics cards would support NVLink, which can be faster than PCIe.

I have access to some A100/H100 GPUs through my university. I was initially trying to work on an A100 (40GB) GPU. The array size (52500x52500) is ~22GB. CuPy wants to allocate twice this much (presumably to store the eigenvectors internally), which was too much for that GPU. I then tried an A100 (80GB) GPU, which was able to allocate the memory, but then ran into a CUSOLVER_STATUS_INVALID_VALUE error. I think this is due to the total number of array elements being greater than the max int value. Does CuPy have the 64-bit API for cuSOLVER?

Can you confirm that smaller matrix sizes work?

Perhaps you could try to run one of the C++/Cuda symmetric matrix examples with some modifications for your Matrix size and batch size 1 to see, whether it is a cuSolver or a CuPy limitation:

You would also better see, how much memory really has to be allocated.

Yes, I can confirm that smaller matrix sizes work. I also saw this issue cusolver error: CUSOLVER_STATUS_INVALID_VALUE for large matrices when trying `linalg.eigh` · Issue #7261 · cupy/cupy · GitHub which seems to be the same as what I am getting.

I will also try out the pytorch solver to see if it supports large array sizes.

The cuSolver has a legacy 32-bit and a newer 64-bit API (since Cuda 11.0?).

In the 32-bit version of the API, helper functions returned the needed buffer size as number of doubles (or whatever type you use) as int* lWork output parameter; the actual computation functions use an int value as input parameter for the buffer size.

The 64-bit API uses size_t parameters.

You either can directly access the 64-bit API (with C++) or need a Python library, which uses the newer 64-bit API (if there is any).

I believe pytorch uses the 64-bit API. I tried this, and I don’t get a cuSOLVER error, but I get an out of memory error. It says it tried to allocate ~66GB after already allocating ~44GB. I don’t know why it needs this much space since the array is only 22GB.

I think I’ll have to resort to using C++ and directly interfacing with the cuSOLVER API for this at the end of the day.

There also was once (10.1) a bug, that a quadratic buffer size requirement was returned, but should have been fixed by now (11.0).

sgesvd_bufferSize int32 overflow with CUDA 10.1 · Issue #2351 · cupy/cupy · GitHub (that issue describes both problems)

See pytorch/aten/src/ATen/native/cuda/linalg/CUDASolver.cpp at main · pytorch/pytorch · GitHub

Or Add cusolver potrf and potrfBatched to the backend of torch.cholesky decomposition by xwang233 · Pull Request #53104 · pytorch/pytorch · GitHub , where the 64-bit API was mentioned and introduced.