Error in cusolverMp syevd + hanging

Hi there,

Trying to get cuSolverMp to work for solving eigensystems. Previously got the examples to work using the HPC SDK, but I am now trying to integrate it into our codebase. Due to the way we already deal with third party libraries we don’t/can’t install the HPC SDK and I am therefore using cusolverMp and CAL from their standalone distributions and get UCX and UCC from the HPC-X package, and cublas+cusolver from CUDA SDK 12.2. Furthermore we use Intel MPI instead of Nvidias OpenMPI fork, I have modified the CAL setup to work with Intel’s MPICH-like interface. Compilation is done with GCC 12.3.0.

When I run a simple standard cusolverMpSyevd it works in serial with one GPU, but when I run with two MPI processes on a single machine with two NVIDIA A100-SXM4 (Driver Version: 560.35.03, CUDA Version: 12.6) it hangs. I set CUSOLVER_MP_LOG_LEVEL=5, CAL_LOG_LEVEL=6, UCC_LOG_LEVEL=error and UCX_LOG_LEVEL=error and receive a bunch of info/error output, that last of which is:

[2024-11-12 15:45:31][cal][894691][Trace][cal_comm_split] UCC allgather in-place
[2024-11-12 15:45:31][cal][894691][Api][cal_comm_get_rank] comm=[de02tcadgpu1 [0]:0] rank=0x15544168f0bc
[2024-11-12 15:45:31][cal][894691][Api][cal_comm_get_size] comm=[de02tcadgpu1 [0]:0] size=0x15544168f0c0
[2024-11-12 15:45:31][cal][894690][Api][cal_comm_get_rank] comm=[de02tcadgpu1 [0]:1] rank=0x1554263b30bc
[2024-11-12 15:45:31][cal][894690][Api][cal_comm_get_size] comm=[de02tcadgpu1 [0]:1] size=0x1554263b30c0
[2024-11-12 15:45:31][cusolverMp][894690][Error][cusolverMpSyevd] CUDA failed with status: invalid argument, file: /home/jenkins/agent/workspace/cusolvermp/helpers/master/L0_MergeRequest/build/src/mp_sytrd.hxx, line: 175
[2024-11-12 15:45:31][cusolverMp][894690][Error][cusolverMpSyevd] cuSOLVER failed with status: 7, file: /home/jenkins/agent/workspace/cusolvermp/helpers/master/L0_MergeRequest/build/src/mp_syevd.hxx, line: 338
[2024-11-12 15:45:31][cal][894691][Api][cal_recv] comm=[de02tcadgpu1 [0]:0] count=256 type=CUDA_R_64F data=0x155410062df0 src_rank=1 tag=77 stream=0x15542007e6e0
[2024-11-12 15:45:31][cal][894691][Trace][cal_recv] ucc_transport::recv() 0 <- 1, 2048 bytes, tag: 77

The matrix distribution is correct and valid, and ELPA finds the correct eigenvectors using both a CPU and GPU kernel, and I’ve double-checked all other preconditions as stated in the documentation. As stated above: it works with one process and one GPU, but in that case it just redirects to the standard cusolver.

My hunch is that is has something to do with the used libraries, the environment or configuration of UCX/UCC, as the example doing the same thing works with the libraries and environment from HPC SDK. But there is so little information and documentation so it’s almost impossible to figure out what goes wrong.

Solved it! The problem was that I was using a stream that was created before calling cudaSetDevice(). Therefore the stream was belonging to the default device, which is not the device used in the following cusolverMp calls, where each GPU is assigned to a unique process on each host.

Simple mistake, but hard to debug - would have been nice with some more informative error message.