Please help with cuSolver failure (CUSOLVER_STATUS_ALLOC_FAILED).


I’ve tried this sample
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\7_CUDALibraries\cuSolverSp_LinearSolver

and this
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v10.1\7_CUDALibraries\cuSolverSp_LowlevelQR

with this matrix
Num rows/cols: 377,002
Nonzeros: 27,582,698

and both failed with CUSOLVER_STATUS_ALLOC_FAILED error status.

I have i7-7700 with 32GB of system memory and GTX 1060 6GB.
Before failure it allocated almost all of 32GB of RAM but nothing happened on GPU (which had 5GB VRAM free at that moment). And I had commented CPU steps out of that samples before start.

So my question: is it possible to solve this matrix on such PC at all?
How can I estimate resources that will be needed for solver to work?


I didn’t have that problem when running the test on a system with Tesla V100-32GB and 128GB of system memory:

$ /usr/local/cuda/samples/7_CUDALibraries/cuSolverSp_LinearSolver/cuSolverSp_LinearSolver --file=ML_Laplace.mtx
GPU Device 0: "Tesla V100-PCIE-32GB" with compute capability 7.0

Using input file [ML_Laplace.mtx]
step 1: read matrix market format
sparse matrix A is 377002 x 377002 with 27689972 nonzeros, base=1
step 2: reorder the matrix A to minimize zero fill-in
        if the user choose a reordering by -P=symrcm, -P=symamd or -P=metis
step 2.1: no reordering is chosen, Q = 0:n-1
step 2.2: B = A(Q,Q)
step 3: b(j) = 1 + j/n
step 4: prepare data on device
step 5: solve A*x = b on CPU
WARNING: the matrix is singular at row 1 under tol (1.000000E-12)
step 6: evaluate residual r = b - A*x (result on CPU)
(CPU) |b - A*x| = NAN
(CPU) |A| = 5.149075E+07
(CPU) |x| = NAN
(CPU) |b| = 1.999997E+00
(CPU) |b - A*x|/(|A|*|x| + |b|) = NAN
step 7: solve A*x = b on GPU
WARNING: the matrix is singular at row 2 under tol (1.000000E-12)
step 8: evaluate residual r = b - A*x (result on GPU)
(GPU) |b - A*x| = NAN
(GPU) |A| = 5.149075E+07
(GPU) |x| = NAN
(GPU) |b| = 1.999997E+00
(GPU) |b - A*x|/(|A|*|x| + |b|) = NAN
timing chol: CPU = 1777.050997 sec , GPU =  82.648804 sec
show last 10 elements of solution vector (GPU)
consistent result for different reordering and solver
x[376992] = -NAN
x[376993] = -NAN
x[376994] = -NAN
x[376995] = -NAN
x[376996] = -NAN
x[376997] = -NAN
x[376998] = -NAN
x[376999] = -NAN
x[377000] = -NAN
x[377001] = -NAN

I assume the NAN values may be due to the singularity warnings, but haven’t investigated it.

I’m not sure where the breaking points would be in terms of memory size, and I don’t have any suggestions for predicting memory size from matrix size, except that larger matrices will require more memory. The error is a good indication that you don’t have enough memory.

Thank you for looking into this!
I spent day or two trying to find out whether I’m doing something wrong with cuSolver.

But still I have some kind of feeling that cuSolver could be optimized to be more memory efficient.

Do you have access to cuSolver source code? Maybe it tries to convert sparse matrix into dense one internally? Also it’s interesting to know how much CPU/GPU memory had it allocated during your run?

I don’t have access to the source code. I can assure you it does not attempt to convert the entire sparse matrix to dense internally. I witnessed about 10GB max of host memory utilization and 23GB max of GPU memory utilization during the run. Converting the sparse matrix in question (377002x377002) to a dense matrix would require over a terabyte of memory.

OK, thank you for information!