Slurm fails with multiple processes MPI_init errors , PML add procs failed

Hi Folks , used slurm workload manager in Bright Cluster manager 9.2 and just edited the slurm.conf file for GRES configurations , MPI_init errors showing up when running srun or sbatch with multiple processes , single process passes . Below is the trace
FYI , using enroot + pyxis for running containers with slurm .

[root@bright88 mxnet]# srun --kill-on-bad-exit=0 --mpi=pmix -G 4 --ntasks=4 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh

STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
running benchmark
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=48-63,112-127 --membind=1 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=32-47,96-111 --membind=1 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
+ exec numactl --physcpubind=0-15,64-79 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
+ exec numactl --physcpubind=16-31,80-95 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
[node001:100528] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100525] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100532] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100527] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[1666178350.584090] [node001:100532:0]    rc_mlx5_devx.c:99   UCX  ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported
[node001:100532] pml_ucx.c:309  Error: Failed to create UCP worker
[node001:100525] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100525] *** An error occurred in MPI_Init_thread
[node001:100525] *** reported by process [142357014,3]
[node001:100525] *** on a NULL communicator
[node001:100525] *** Unknown error
[node001:100525] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100525] ***    and potentially your MPI job)
[node001:100528] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100528] *** An error occurred in MPI_Init_thread
[node001:100528] *** reported by process [142357014,2]
[node001:100528] *** on a NULL communicator
[node001:100528] *** Unknown error
[node001:100528] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100528] ***    and potentially your MPI job)
[node001:100527] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100527] *** An error occurred in MPI_Init_thread
[node001:100527] *** reported by process [142357014,1]
[node001:100527] *** on a NULL communicator
[node001:100527] *** Unknown error
[node001:100527] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100527] ***    and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 213.0 ON node001 CANCELLED AT 2022-10-19T20:19:10 ***
srun: error: node001: tasks 0-3: Killed

UCX and PMIX errors can be ignored by adding the flags in srun or sbatch

Below is the trace from simply running < ls > command on node001 , it runs fine

[root@bright88 ~]# srun --kill-on-bad-exit=0 --mpi=pmix -G 4 --ntasks=4 -w node001 ls /mnt/lustre
cosmoUniverse_2019_05_4parE_tf_small
cosmoUniverse_2019_05_4parE_tf_v2
cosmoUniverse_2019_05_4parE_tf_v2.tar
mxnet
processed
results
cosmoUniverse_2019_05_4parE_tf_small
cosmoUniverse_2019_05_4parE_tf_v2
cosmoUniverse_2019_05_4parE_tf_v2.tar
mxnet
processed
results
cosmoUniverse_2019_05_4parE_tf_small
cosmoUniverse_2019_05_4parE_tf_v2
cosmoUniverse_2019_05_4parE_tf_v2.tar
mxnet
processed
results
cosmoUniverse_2019_05_4parE_tf_small
cosmoUniverse_2019_05_4parE_tf_v2
cosmoUniverse_2019_05_4parE_tf_v2.tar
mxnet
processed
results

Hi.

While we appreciate the questions, I must stress that this forum is NOT a replacement for professional support.
We will answer your questions when we have spare cycles to do so, but you should strongly consider purchasing a commercial license to gain access to the support team.

kw

Thanks Ken , I understand but I have no choice as of now . I am trying to build a HPC cluster POC around Bright so that I can put up across my management for buying licenses etc . Till then I just have to figure myself or hope for the help :)

Ah! That’s good to know.

In that case, the correct path forward is to obtain an evaluation license from the sales team. An eval license will give you access to the support team.

Would you like to send me your contact information via email, (kwoods@nvidia.com), and I can coordinate contact with sales?

Thanks,
kw

I have easy8 license that’s being used right now , I tried getting the evaluation license but nobody contacted me . I will send the email to you . Appreciate it

Thanks Ken , got the requested support. Is bright website down today , I am trying to login but page getting timed out .