Out of memory error in cuSparse ILU0

Hi,

I am using the cuSparse library to compute an ILU0 preconditioner for a PCG solver.

When I run a small problem, everything works on multiple GPUs.

However, when I run a very large problem, the code works on 1 GPU (it fits in memory) but when I try to run on 2 or more GPUs I get “out of memory” errors (one per GPU).

The routine I am calling is:

cusparseStatus = cusparseDcsrilu02(cusparseHandle, N, M, M_described, CSR_LU, CSR_I, CSR_J, M_analyzed, M_policy, Mbuffer);

if (cusparseStatus!=0){ printf(" ERROR! ILU0 Formation Error = %s \n", cusparseGetErrorString(cusparseStatus)); exit(1); }

And the error I am getting is:

ERROR! ILU0 Analysis Error = out of memory

To run on the multiple GPUs, I am using MPI with 1 rank per GPU, and setting the CUDA device through OpenACC from the Fortran code that calls the C code with the CUDA call in it as:
!$acc set device_num(irank), which in another post on the forums should be calling cudaSetDevice() underneath.

When using more than 1 GPU, the size of N and M are smaller than when using 1 GPU (they are “NGPUs” smaller) so I do not understand why the code can run on 1 GPU, but is running out of memory on multiple GPUs.

When we run the same code on multiple A100 GPUs (with more memory (40GB) per GPU than the V100s 32GB used here) the problem works as well.

The total memory needed when run on 1 GPU is only about 20GB.

I have also checked running nvidia-smi during the run and confirmed that the two MPI ranks are running on 2 GPUs (no oversubscribing going on).

Thanks!

– Ron

Some more information:

It seems to get NP-1 “out of memory” errors when run on NP GPUs (1 for 2 GPUS, 3 for 4 GPUs).

Not sure if this helps, but thought I would mention it.

– Ron

Oh, I forgot to mention that I am using NV SDK 22.1, run within a singularity container, using the HPC-X OpenMPI4 from inside the container.
I am running this on the Expanse system at SDSC using SLURM.

Also, the runs were just tested on a DGC A100-40GB system (using the same container) and they all worked fine using 1 to 8 GPUs.

– Ron

The routine should run without any problems on multiple GPUs. A few checks that can be verified:

  • Call cudaDeviceSynchronize() and cudaGetLastError() to ensure that there are no previous errors
  • Call cudaMemGetInfo() to ensure that there is enough memory
  • Run compute-sanitizer to check potential API/Kernel errors
  • Check the buffer size returned by cusparseScsrilu02_bufferSize()

Also, can you please report the size of the problem?