cusolverMp deadlock

Hello,

I’m doing LU factorization (cusolverMpGetrf) with cusolverMp (both 0.4.0 and 0.4.1) with varying matrix size.
It runs well on 20,000 x 20,000 single precision matrix with process grid 2 x 2 (four A100 GPUs), but it deadlocks when it comes to a bigger size (~ 57,000 x 57,000).

Below, there are log files with setting CAL_LOG_LEVEL=2 and CAL_LOG_FILE=log.cal.%d

  1. Case with 20,000 x 20,000 matrix
==> log.cal.21219 <==
[2023-07-14 17:23:10][cal][21219][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21219][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21219][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21219][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21219][Trace][cal_bcast] UCC bcast

==> log.cal.21220 <==
[2023-07-14 17:23:10][cal][21220][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21220][Trace][cal_allgather] UCC allgather in-place
[2023-07-14 17:23:10][cal][21220][Trace][cal_allgather] UCC allgather out-of-place
[2023-07-14 17:23:10][cal][21220][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21220][Trace][cal_bcast] UCC bcast

==> log.cal.21221 <==
[2023-07-14 17:23:10][cal][21221][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21221][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21221][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21221][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21221][Trace][cal_bcast] UCC bcast

==> log.cal.21222 <==
[2023-07-14 17:23:10][cal][21222][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21222][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21222][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21222][Trace][cal_bcast] UCC bcast
[2023-07-14 17:23:10][cal][21222][Trace][cal_bcast] UCC bcast
  1. Case with 57,000 x 57,000 matrix
==> log.cal.21346 <==
[2023-07-14 17:27:01][cal][21346][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21346][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21346][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21346][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21346][Trace][cal_bcast] UCC bcast

==> log.cal.21347 <==
[2023-07-14 17:27:01][cal][21347][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21347][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21347][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21347][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21347][Trace][cal_bcast] UCC bcast

==> log.cal.21348 <==
[2023-07-14 17:27:01][cal][21348][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21348][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21348][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21348][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21348][Trace][cal_bcast] UCC bcast

==> log.cal.21349 <==
[2023-07-14 17:27:01][cal][21349][Trace][cal_allgather] UCC allgather in-place
[2023-07-14 17:27:01][cal][21349][Trace][cal_allgather] UCC allgather out-of-place
[2023-07-14 17:27:01][cal][21349][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21349][Trace][cal_bcast] UCC bcast
[2023-07-14 17:27:01][cal][21349][Trace][cal_allgather] UCC allgather in-place

I think the allgather function called from the last process (21349) causes a deadlock, but I cannot find a workaround to solve the issue.
Any idea?

Best regards,

Hi could you give more info on the problem parameters, like block size, IA, JA, pivoting?

Hi!

This is the log when CUSOLVERMP_LOG_LEVEL=6. Hope you find this useful.
Additionally, regardless of the d_ipiv size (whether it is row/column size of global matrix or LOCr(M_A) + MB_A or LOCc(M_A), …), the program stops at mpgetrf.

[2023-07-17 14:55:30][cusolverMp][45960][Api][cusolverMpCreate] API=cusolverMpCreate, handle=0x7ffcff5a6220, deviceId=1, streamId=0x5870af0
[2023-07-17 14:55:30][cusolverMp][45961][Api][cusolverMpCreate] API=cusolverMpCreate, handle=0x7ffc8aeadb70, deviceId=2, streamId=0x5c7c5f0
[2023-07-17 14:55:30][cusolverMp][45962][Api][cusolverMpCreate] API=cusolverMpCreate, handle=0x7ffce1e02900, deviceId=3, streamId=0x52d60f0
[2023-07-17 14:55:30][cusolverMp][45959][Api][cusolverMpCreate] API=cusolverMpCreate, handle=0x7ffcce3bce40, deviceId=0, streamId=0x4695b00
[2023-07-17 14:55:31][cusolverMp][45962][Api][cusolverMpCreateDeviceGrid] API=cusolverMpCreateDeviceGrid, handle=0x53018b0, grid=0x7ffce1e02910, comm=0x3b01f30, numRowDevices=2, numColDevices=2, mapping=0
[2023-07-17 14:55:31][cusolverMp][45962][Api][cusolverMpCreateDeviceGrid] API=cusolverMpCreateDeviceGrid, handle=0x53018b0, grid=0x7ffce1e02918, comm=0x3b01f30, numRowDevices=2, numColDevices=2, mapping=0
[2023-07-17 14:55:31][cusolverMp][45962][Api][cusolverMpCreateMatrixDesc] API=cusolverMpCreateMatrixDesc, descr=0x7ffce1e02a80, grid=0x3f1d6890, dataType=0, M_A=57219, N_A=57219, MB_A=1024, NB_A=1024, RSRC_A=0, CSRC_A=0, LLD_A=28672
[2023-07-17 14:55:31][cusolverMp][45962][Api][cusolverMpCreateMatrixDesc] API=cusolverMpCreateMatrixDesc, descr=0x7ffce1e02a88, grid=0x3f1d68b0, dataType=0, M_A=57219, N_A=1, MB_A=1024, NB_A=1024, RSRC_A=0, CSRC_A=0, LLD_A=28672
[2023-07-17 14:55:31][cusolverMp][45960][Api][cusolverMpCreateDeviceGrid] API=cusolverMpCreateDeviceGrid, handle=0x5c2f0d0, grid=0x7ffcff5a6230, comm=0x4397eb0, numRowDevices=2, numColDevices=2, mapping=0
[2023-07-17 14:55:31][cusolverMp][45960][Api][cusolverMpCreateDeviceGrid] API=cusolverMpCreateDeviceGrid, handle=0x5c2f0d0, grid=0x7ffcff5a6238, comm=0x4397eb0, numRowDevices=2, numColDevices=2, mapping=0
[2023-07-17 14:55:31][cusolverMp][45960][Api][cusolverMpCreateMatrixDesc] API=cusolverMpCreateMatrixDesc, descr=0x7ffcff5a63a0, grid=0x3fa73fb0, dataType=0, M_A=57219, N_A=57219, MB_A=1024, NB_A=1024, RSRC_A=0, CSRC_A=0, LLD_A=28672
[2023-07-17 14:55:31][cusolverMp][45960][Api][cusolverMpCreateMatrixDesc] API=cusolverMpCreateMatrixDesc, descr=0x7ffcff5a63a8, grid=0x3fa73fd0, dataType=0, M_A=57219, N_A=1, MB_A=1024, NB_A=1024, RSRC_A=0, CSRC_A=0, LLD_A=28672
[2023-07-17 14:55:31][cusolverMp][45959][Api][cusolverMpCreateDeviceGrid] API=cusolverMpCreateDeviceGrid, handle=0x4691920, grid=0x7ffcce3bce50, comm=0x207dfc0, numRowDevices=2, numColDevices=2, mapping=0
[2023-07-17 14:55:31][cusolverMp][45959][Api][cusolverMpCreateDeviceGrid] API=cusolverMpCreateDeviceGrid, handle=0x4691920, grid=0x7ffcce3bce58, comm=0x207dfc0, numRowDevices=2, numColDevices=2, mapping=0
[2023-07-17 14:55:31][cusolverMp][45959][Api][cusolverMpCreateMatrixDesc] API=cusolverMpCreateMatrixDesc, descr=0x7ffcce3bcfc0, grid=0x3ed8c2d0, dataType=0, M_A=57219, N_A=57219, MB_A=1024, NB_A=1024, RSRC_A=0, CSRC_A=0, LLD_A=28672
[2023-07-17 14:55:31][cusolverMp][45959][Api][cusolverMpCreateMatrixDesc] API=cusolverMpCreateMatrixDesc, descr=0x7ffcce3bcfc8, grid=0x3ed8c2f0, dataType=0, M_A=57219, N_A=1, MB_A=1024, NB_A=1024, RSRC_A=0, CSRC_A=0, LLD_A=28672
[2023-07-17 14:55:31][cusolverMp][45961][Api][cusolverMpCreateDeviceGrid] API=cusolverMpCreateDeviceGrid, handle=0x5c78410, grid=0x7ffc8aeadb80, comm=0x495da00, numRowDevices=2, numColDevices=2, mapping=0
[2023-07-17 14:55:31][cusolverMp][45961][Api][cusolverMpCreateDeviceGrid] API=cusolverMpCreateDeviceGrid, handle=0x5c78410, grid=0x7ffc8aeadb88, comm=0x495da00, numRowDevices=2, numColDevices=2, mapping=0
[2023-07-17 14:55:31][cusolverMp][45961][Api][cusolverMpCreateMatrixDesc] API=cusolverMpCreateMatrixDesc, descr=0x7ffc8aeadcf0, grid=0x40033400, dataType=0, M_A=57219, N_A=57219, MB_A=1024, NB_A=1024, RSRC_A=0, CSRC_A=0, LLD_A=28672
[2023-07-17 14:55:31][cusolverMp][45961][Api][cusolverMpCreateMatrixDesc] API=cusolverMpCreateMatrixDesc, descr=0x7ffc8aeadcf8, grid=0x40033420, dataType=0, M_A=57219, N_A=1, MB_A=1024, NB_A=1024, RSRC_A=0, CSRC_A=0, LLD_A=28672

[2023-07-17 14:56:19][cusolverMp][45961][Api][cusolverMpGetrf_bufferSize] handle=0x5c78410, M=57219, N=57219, d_A=0x7f38e4000000, IA=1, JA=1, descrA=0x40035890, d_ipiv=0x7f3a7a6a1400, computeType=0, workspaceInBytesOnDevice=0x7ffc8aead940, workspaceInBytesOnHost=0x7ffc8aead948
[2023-07-17 14:56:19][cusolverMp][45961][Info][cusolverMpGetrf_bufferSize] Query workspace for pivoting getrf
[2023-07-17 14:56:19][cusolverMp][45959][Api][cusolverMpGetrf_bufferSize] handle=0x4691920, M=57219, N=57219, d_A=0x7f5122000000, IA=1, JA=1, descrA=0x3ed8e760, d_ipiv=0x7f52b66a1400, computeType=0, workspaceInBytesOnDevice=0x7ffcce3bcc10, workspaceInBytesOnHost=0x7ffcce3bcc18
[2023-07-17 14:56:19][cusolverMp][45959][Info][cusolverMpGetrf_bufferSize] Query workspace for pivoting getrf
[2023-07-17 14:56:19][cusolverMp][45962][Api][cusolverMpGetrf_bufferSize] handle=0x53018b0, M=57219, N=57219, d_A=0x7f2a94000000, IA=1, JA=1, descrA=0x3f1d8d20, d_ipiv=0x7f2c2a6a1400, computeType=0, workspaceInBytesOnDevice=0x7ffce1e026d0, workspaceInBytesOnHost=0x7ffce1e026d8
[2023-07-17 14:56:19][cusolverMp][45962][Info][cusolverMpGetrf_bufferSize] Query workspace for pivoting getrf
[2023-07-17 14:56:19][cusolverMp][45960][Api][cusolverMpGetrf_bufferSize] handle=0x5c2f0d0, M=57219, N=57219, d_A=0x7f3b16000000, IA=1, JA=1, descrA=0x3fa76440, d_ipiv=0x7f3caa6a1400, computeType=0, workspaceInBytesOnDevice=0x7ffcff5a5ff0, workspaceInBytesOnHost=0x7ffcff5a5ff8
[2023-07-17 14:56:19][cusolverMp][45960][Info][cusolverMpGetrf_bufferSize] Query workspace for pivoting getrf
[2023-07-17 14:56:19][cusolverMp][45962][Api][cusolverMpGetrf] handle=0x53018b0, M=57219, N=57219, d_A=0x7f2a94000000, IA=1, JA=1, descrA=0x3f1d8d20, d_ipiv=0x7f2c2a6a1400, computeType=0, d_work=0x7f2b58000000, workspaceInBytesOnDevice=305797056, h_work=0x4b0d32e0, workspaceInBytesOnHost=5056064, d_info=0x7f2c207f7c00
[2023-07-17 14:56:19][cusolverMp][45962][Info][cusolverMpGetrf] use pivoting algorithm
[2023-07-17 14:56:19][cusolverMp][45961][Api][cusolverMpGetrf] handle=0x5c78410, M=57219, N=57219, d_A=0x7f38e4000000, IA=1, JA=1, descrA=0x40035890, d_ipiv=0x7f3a7a6a1400, computeType=0, d_work=0x7f39a8000000, workspaceInBytesOnDevice=305797056, h_work=0x4bcdc720, workspaceInBytesOnHost=5056064, d_info=0x7f3a707f7c00
[2023-07-17 14:56:19][cusolverMp][45961][Info][cusolverMpGetrf] use pivoting algorithm
[2023-07-17 14:56:19][cusolverMp][45959][Api][cusolverMpGetrf] handle=0x4691920, M=57219, N=57219, d_A=0x7f5122000000, IA=1, JA=1, descrA=0x3ed8e760, d_ipiv=0x7f52b66a1400, computeType=0, d_work=0x7f5207c00000, workspaceInBytesOnDevice=305797056, h_work=0x4ab36720, workspaceInBytesOnHost=5056064, d_info=0x7f52abdf9600
[2023-07-17 14:56:19][cusolverMp][45959][Info][cusolverMpGetrf] use pivoting algorithm
[2023-07-17 14:56:19][cusolverMp][45960][Api][cusolverMpGetrf] handle=0x5c2f0d0, M=57219, N=57219, d_A=0x7f3b16000000, IA=1, JA=1, descrA=0x3fa76440, d_ipiv=0x7f3caa6a1400, computeType=0, d_work=0x7f3bda800000, workspaceInBytesOnDevice=305797056, h_work=0x4b81e9c0, workspaceInBytesOnHost=5056064, d_info=0x7f3c9fdf9600
[2023-07-17 14:56:19][cusolverMp][45960][Info][cusolverMpGetrf] use pivoting algorithm

I checked the norm of (cublasnrm2) each row of the matrix and it seems that the routine (especially, pivoting) stops due to zero |A_i|. Ad-hoc solution with setting one of the A_i values to non-zero worked.

Hi, there’s likely something wrong with the error handling when a rank returns early. We will look into it. Thanks for your input.

Thank you.
Besides, what is the proper size of the pivoting array, d_ipiv?
In cusolverMp manual, it is LOCr(M_A) + MB_A, but it seems LOCc(M_A) in mp_getrf_getrs.c example in CUDALibrarySamples.

It should be LOCc(N_A). We will fix the manual.
Good catch. Thanks again.