Hi, I am trying to use multiple GPUs using MPI to solve a large linear sparse system similar to simple_mgmn_mode
in the library samples. All of the matrices are stored in the first GPU.
Even tough the code runs with only one GPU I am getting this error when I try to use 2 GPUs:
[2025-07-19 23:25:27][CUDSS][2175735][Api][cudssCreate] start
[2025-07-19 23:25:29][CUDSS][2175735][Api][cudssSetCommLayer] start
[2025-07-19 23:25:29][CUDSS][2175735][Api][cudssConfigCreate] start
[2025-07-19 23:25:29][CUDSS][2175734][Api][cudssMatrixCreateCsr] start
[2025-07-19 23:25:45][CUDSS][2175734][Api][cudssDataCreate] start
[2025-07-19 23:25:45][CUDSS][2175734][Api][cudssDataSet] start
[2025-07-19 23:25:45][CUDSS][2175735][Api][cudssMatrixCreateDn] start
[2025-07-19 23:25:45][CUDSS][2175735][Api][cudssMatrixCreateDn] start
[2025-07-19 23:25:45][CUDSS][2175735][Api][cudssExecute] start
[2025-07-19 23:25:45][CUDSS][2175735][Info][cudssExecute] Run with phase 3
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 6: 0 (must be -1 from index 2016)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 7: 0 (must be -1 from index 2032)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 8: 0 (must be -1 from index 2040)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-19 23:25:55][CUDSS][2175735][Error][cudssExecute] Elimination tree is not consistent
The error happens at the cudssExecute call:
cudssErrchk(cudssExecute(cudssH,
CUDSS_PHASE_ANALYSIS,
cudss_solverConfig,
cudss_solverData,
cudss_Ac,
cudss_Qc,
cudss_cwork));
Hi @glassbook!
From the log it seems that the elimination tree was wrong. This might happen due to two reasons:
-
invalid initial matrix data.
Please check that your matrix arrays have correct indices, which match the indexing base and also that the matrix type and matrix view are as you expect.
Also, check that the limitations for cudssExecute() and MGMN mode are satisfied. E.g., all processes must have correct matrix shapes.
-
a bug in how cuDSS distributed the matrix among processes for your matrix (which somehow triggered an unaccounted edge case).
If checking the matrix data does not help, could you share your matrix and a reproducer (or at least the code how you call cuDSS)?
Thanks,
Kirill
Thanks!
I have checked the inputted matrices as you have suggested but the problem persisted. Then, I also re-compiled and runned the simple_mgmn_mode/simple_mgmn_mode.cpp
from CUDALibrarySamples
and it gave a similar error.
I have also got this warning about MPI and as recently there were some changes in our clusters MPI installation, I think it might be caused by that.
WARNING: A user-supplied value attempted to override the default-only MCA
variable named "mpi_built_with_cuda_support".
The user-supplied value was ignored.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
You requested to run with CUDA GPU Direct RDMA support but this OFED
installation does not have that support. Contact Mellanox to figure
out how to get an OFED stack with that support.
I have forwarded the MPI warning to our sysadmin and currently trying to install CUDA aware MPI without root locally to check it myself if it is really from the MPI.
Error output from simple_mgmn_mode/simple_mgmn_mode.cpp
:
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 6: -2 (must be -1 from index 2031)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 7: -2 (must be -1 from index 2039)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 8: -2 (must be -1 from index 2043)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-21 15:30:11][CUDSS][1236892][Error][cudssExecute] Elimination tree is not consistent
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed nodecompile mpi without root bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 6: -2 (must be -1 from index 2031)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 7: -2 (must be -1 from index 2039)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 8: -2 (must be -1 from index 2043)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-21 15:30:11][CUDSS][1236884][Error][cudssExecute] Elimination tree is not consistent
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 6: -2 (must be -1 from index 2031)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 7: -2 (must be -1 from index 2039)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 8: -2 (must be -1 from index 2043)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-21 15:30:11][CUDSS][1236878][Error][cudssExecute] Elimination tree is not consistent
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 6: -2 (must be -1 from index 2031)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 7: -2 (must be -1 from index 2039)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 8: -2 (must be -1 from index 2043)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-21 15:30:11][CUDSS][1236888][Error][cudssExecute] Elimination tree is not consistent
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 6: -2 (must be -1 from index 2031)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 7: -2 (must be -1 from index 2039)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 8: -2 (must be -1 from index 2043)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-21 15:30:11][CUDSS][1236880][Error][cudssExecute] Elimination tree is not consistent
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 6: -2 (must be -1 from index 2031)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 6: -2 (must be -1 from index 2031)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 7: -2 (must be -1 from index 2039)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 8: -2 (must be -1 from index 2043)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-21 15:30:11][CUDSS][1236886][Error][cudssExecute] Elimination tree is not consistent
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 7: -2 (must be -1 from index 2039)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 8: -2 (must be -1 from index 2043)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-21 15:30:11][CUDSS][1236882][Error][cudssExecute] Elimination tree is not consistent
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 0: 0 (must be -1 from index 0)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 1: 0 (must be -1 from index 1024)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 2: 0 (must be -1 from index 1536)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 3: 0 (must be -1 from index 1792)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 4: 0 (must be -1 from index 1920)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong left endpoint for lvl = 5: 0 (must be -1 from index 1984)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 6: -2 (must be -1 from index 2031)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 7: -2 (must be -1 from index 2039)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 8: -2 (must be -1 from index 2043)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Distributed node bounds has a wrong right endpoint for lvl = 9: -2 (must be -1 from index 2045)
[2025-07-21 15:30:11][CUDSS][1236890][Error][cudssExecute] Elimination tree is not consistent
Indeed the warnings from MPI look scary and might explain the behavior. I am slightly surprised that a problem didn’t manifest itself earlier, though.
There a sequence of steps which you can take to validate that the communication layer and communication backend work in your environment.
- First experiment: call one of the simple routines from the communication backend, like ncclCommUserRank() for NCCL or MPI_Comm_rank() for MPI.
- If the previous experiment works, cudaMalloc a buffer and call a collective, like ncclBcast() or MPI_Bcast.
- If the previous experiment works, next step would be to check the commlayer (by using dlopen/dlsym and calling the same symbol for the broadcast but through cudssBcast())
I suspect, in your case the second experiment should fail. One day, we will add these steps to our documentation for troubleshooting MGMN environment issues.
One advice about CUDA aware MPI: you can try to get one by installing HPC SDK NVIDIA HPC SDK Current Release Downloads | NVIDIA Developer.
Thanks! It seems like there were some problems with our own MPI installation. We were getting segfaults when we called MPI_Allreduce
. After using NVHPC SDK and its own OpenMPI implementation as you have mentioned, the errors were gone.
Though unfortunately even with single precision + 3*48Gib GPUs + hybrid memory mode we are still getting memory allocation errors for large samples (5e6x5e6 with 1e9 non-zeroes). I know it is not the right place to ask as it is not about the original question but do you have any suggestions that we can try other than using more GPUs?
Great that using HPC SDK resolves the configuration issues!
Regarding the memory errors:
- Which version of cuDSS are you using?
If cudss 0.6.0, are you using matching/scaling?
- Does the analysis succeed? If so, could you query the memory estimates via cudssDataGet() with CUDSS_DATA_MEMORY_ESTIMATES and share the output (on each process, and keep an eye on both host and device memory requirements)? Additionally, could you query
the free memory on device right before or after it via cudaMemGetInfo() and report how much free device memory each process reports?
- If even the analysis phase cannot succeed, I am afraid the only reasonable way is (if possible) that you share a matrix with us and we check internally whether the memory consumption can be reduced and how.
One other possibility for going out of memory could be if all of your MPI processes are using the same GPU, please check that it doesn’t happen (it might depend on the environment settings).
Thanks! Unfortunately as the current GPU queue is quite busy right now I have not been able to rerun the program but I am in queue for tomorrow.
- I am using cuDSS 0.6.0 without matching/scaling
- I am not sure currently but will definitely check tomorrow. But as I had got
Memory allocation failed with error = 2 for size = 43441738584
as an error I guess it would also be larger than the limit.
- As the matrix is created on the fly, I will save it in the next run. Would the
.npy
format be OK?
- There was a high GPU utilization in all of the GPUs while the program was running so it uses all of them.
At the time of the error, there was20GiB
of unused space left on each of the non-root GPU, so there was ~40GiB
of extra space. Would using distributed matrices can utilize the unused space on the non-root GPUs?