Command line:
srun -N 2 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix --container-image=“${CONT}” ./hpl.sh --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-H200-16GPUs.dat
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/common/transport_ib_common.cpp:97: NULL value mem registration failed. Reason: Bad address
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:498: non-zero status: 2 Unable to register memory handle.
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/mem/mem_heap.cpp:931: non-zero status: 7 register_mem_handle failed for remote
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/mem/mem_heap.cpp:1099: non-zero status: 7 register heap memory failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/mem/mem_heap.cpp:1534: non-zero status: 7 register heap UC memory failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/mem/mem_heap.cpp:533: non-zero status: 1 cuMemAddressFree failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/mem/mem_heap.cpp:1591: non-zero status: 7 allocate_physical_memory_to_heap failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/proxy/proxy.cpp:130: NULL value failed allocating proxy_channel_g_buf
channel creation failed
srun: error: slurm-compute-node-1: task 11: Exited with exit code 255
slurmstepd: error: mpi/pmix_v5: _errhandler: slurm-compute-node-1 [1]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.128.0:11]
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.