MPI error while running HPL

I run HPL on Ubuntu 20.04 system with two PCI-E A100 40GB, one Epyc 7413 24-core CPU and 128 GB RAM via enroot:

enroot start nv-hpl-bench

mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 2 hpl.sh --dat ./HPL.dat --cpu-affinity 0:0 --cpu-cores-per-rank 4 --gpu-affinity 0:1

I set MELLANOX_VISIBLE_DEVICES=“none”

Test end with MPI error:

[esc4k:03485] *** An error occurred in MPI_Wait
[esc4k:03485] *** reported by process [1684144129,1]
[esc4k:03485] *** on communicator MPI COMMUNICATOR 6 SPLIT FROM 4
[esc4k:03485] *** MPI_ERR_TRUNCATE: message truncated
[esc4k:03485] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[esc4k:03485] *** and potentially your MPI job)
oem@esc4k:/workspace$
Message from syslogd@esc4k at Jul 21 00:06:40 …
kernel:[ 3238.085843] watchdog: BUG: soft lockup - CPU#11 stuck for 22s! [cuda-EvtHandlr:3498]

… and in processes have xhpl with zombie status. PC hangs.

HPL.dat:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
50000 Ns
1 # of NBs
144 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
1 Qs
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
3 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
192 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
0 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

Need help.