Error in cusolverMp syevd + hanging

filipr · November 12, 2024, 3:12pm

Hi there,

Trying to get cuSolverMp to work for solving eigensystems. Previously got the examples to work using the HPC SDK, but I am now trying to integrate it into our codebase. Due to the way we already deal with third party libraries we don’t/can’t install the HPC SDK and I am therefore using cusolverMp and CAL from their standalone distributions and get UCX and UCC from the HPC-X package, and cublas+cusolver from CUDA SDK 12.2. Furthermore we use Intel MPI instead of Nvidias OpenMPI fork, I have modified the CAL setup to work with Intel’s MPICH-like interface. Compilation is done with GCC 12.3.0.

When I run a simple standard cusolverMpSyevd it works in serial with one GPU, but when I run with two MPI processes on a single machine with two NVIDIA A100-SXM4 (Driver Version: 560.35.03, CUDA Version: 12.6) it hangs. I set CUSOLVER_MP_LOG_LEVEL=5, CAL_LOG_LEVEL=6, UCC_LOG_LEVEL=error and UCX_LOG_LEVEL=error and receive a bunch of info/error output, that last of which is:

[2024-11-12 15:45:31][cal][894691][Trace][cal_comm_split] UCC allgather in-place
[2024-11-12 15:45:31][cal][894691][Api][cal_comm_get_rank] comm=[de02tcadgpu1 [0]:0] rank=0x15544168f0bc
[2024-11-12 15:45:31][cal][894691][Api][cal_comm_get_size] comm=[de02tcadgpu1 [0]:0] size=0x15544168f0c0
[2024-11-12 15:45:31][cal][894690][Api][cal_comm_get_rank] comm=[de02tcadgpu1 [0]:1] rank=0x1554263b30bc
[2024-11-12 15:45:31][cal][894690][Api][cal_comm_get_size] comm=[de02tcadgpu1 [0]:1] size=0x1554263b30c0
[2024-11-12 15:45:31][cusolverMp][894690][Error][cusolverMpSyevd] CUDA failed with status: invalid argument, file: /home/jenkins/agent/workspace/cusolvermp/helpers/master/L0_MergeRequest/build/src/mp_sytrd.hxx, line: 175
[2024-11-12 15:45:31][cusolverMp][894690][Error][cusolverMpSyevd] cuSOLVER failed with status: 7, file: /home/jenkins/agent/workspace/cusolvermp/helpers/master/L0_MergeRequest/build/src/mp_syevd.hxx, line: 338
[2024-11-12 15:45:31][cal][894691][Api][cal_recv] comm=[de02tcadgpu1 [0]:0] count=256 type=CUDA_R_64F data=0x155410062df0 src_rank=1 tag=77 stream=0x15542007e6e0
[2024-11-12 15:45:31][cal][894691][Trace][cal_recv] ucc_transport::recv() 0 <- 1, 2048 bytes, tag: 77

The matrix distribution is correct and valid, and ELPA finds the correct eigenvectors using both a CPU and GPU kernel, and I’ve double-checked all other preconditions as stated in the documentation. As stated above: it works with one process and one GPU, but in that case it just redirects to the standard cusolver.

My hunch is that is has something to do with the used libraries, the environment or configuration of UCX/UCC, as the example doing the same thing works with the libraries and environment from HPC SDK. But there is so little information and documentation so it’s almost impossible to figure out what goes wrong.

filipr · November 29, 2024, 12:36pm

Solved it! The problem was that I was using a stream that was created before calling cudaSetDevice(). Therefore the stream was belonging to the default device, which is not the device used in the following cusolverMp calls, where each GPU is assigned to a unique process on each host.

Simple mistake, but hard to debug - would have been nice with some more informative error message.

Topic		Replies	Views
cuSolverMG and CMP 170 HX GPU-Accelerated Libraries cusolver	15	275	October 10, 2024
MPI + CUDA Problem CUDA Programming and Performance	0	6474	October 3, 2011
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	11	955	April 26, 2023
Streaming cuSolver GPU-Accelerated Libraries	2	1610	June 9, 2015
Multi-GPU MPI launch failing when UVM enabled Legacy PGI Compilers	5	3761	January 2, 2019
CUDA+MPI = Unexplained Issues... Random Crashes, Errenous Output?!? CUDA Programming and Performance	5	3247	July 7, 2008
When i use cudoublecomplex in cusolver,the error occurs GPU-Accelerated Libraries cusolver	2	356	November 15, 2023
Cublas_status_execution_failed GPU-Accelerated Libraries	2	10629	February 23, 2021
Problem with cusolver pivoting GPU-Accelerated Libraries cuda	0	702	April 3, 2021
Using multiple GPUs Legacy PGI Compilers	7	22066	August 11, 2009

Error in cusolverMp syevd + hanging

Related topics