Slurm fails with multiple processes MPI_init errors , PML add procs failed

karanveersingh5623 · October 19, 2022, 11:25am

Hi Folks , used slurm workload manager in Bright Cluster manager 9.2 and just edited the slurm.conf file for GRES configurations , MPI_init errors showing up when running srun or sbatch with multiple processes , single process passes . Below is the trace
FYI , using enroot + pyxis for running containers with slurm .

[root@bright88 mxnet]# srun --kill-on-bad-exit=0 --mpi=pmix -G 4 --ntasks=4 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh

STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
running benchmark
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=48-63,112-127 --membind=1 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=32-47,96-111 --membind=1 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
+ exec numactl --physcpubind=0-15,64-79 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
+ exec numactl --physcpubind=16-31,80-95 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
[node001:100528] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100525] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100532] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100527] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[1666178350.584090] [node001:100532:0]    rc_mlx5_devx.c:99   UCX  ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported
[node001:100532] pml_ucx.c:309  Error: Failed to create UCP worker
[node001:100525] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100525] *** An error occurred in MPI_Init_thread
[node001:100525] *** reported by process [142357014,3]
[node001:100525] *** on a NULL communicator
[node001:100525] *** Unknown error
[node001:100525] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100525] ***    and potentially your MPI job)
[node001:100528] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100528] *** An error occurred in MPI_Init_thread
[node001:100528] *** reported by process [142357014,2]
[node001:100528] *** on a NULL communicator
[node001:100528] *** Unknown error
[node001:100528] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100528] ***    and potentially your MPI job)
[node001:100527] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100527] *** An error occurred in MPI_Init_thread
[node001:100527] *** reported by process [142357014,1]
[node001:100527] *** on a NULL communicator
[node001:100527] *** Unknown error
[node001:100527] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100527] ***    and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 213.0 ON node001 CANCELLED AT 2022-10-19T20:19:10 ***
srun: error: node001: tasks 0-3: Killed

karanveersingh5623 · October 19, 2022, 11:27am

UCX and PMIX errors can be ignored by adding the flags in srun or sbatch

karanveersingh5623 · October 19, 2022, 11:28am

Below is the trace from simply running < ls > command on node001 , it runs fine

[root@bright88 ~]# srun --kill-on-bad-exit=0 --mpi=pmix -G 4 --ntasks=4 -w node001 ls /mnt/lustre
cosmoUniverse_2019_05_4parE_tf_small
cosmoUniverse_2019_05_4parE_tf_v2
cosmoUniverse_2019_05_4parE_tf_v2.tar
mxnet
processed
results
cosmoUniverse_2019_05_4parE_tf_small
cosmoUniverse_2019_05_4parE_tf_v2
cosmoUniverse_2019_05_4parE_tf_v2.tar
mxnet
processed
results
cosmoUniverse_2019_05_4parE_tf_small
cosmoUniverse_2019_05_4parE_tf_v2
cosmoUniverse_2019_05_4parE_tf_v2.tar
mxnet
processed
results
cosmoUniverse_2019_05_4parE_tf_small
cosmoUniverse_2019_05_4parE_tf_v2
cosmoUniverse_2019_05_4parE_tf_v2.tar
mxnet
processed
results

kwoods · October 19, 2022, 11:30am

Hi.

While we appreciate the questions, I must stress that this forum is NOT a replacement for professional support.
We will answer your questions when we have spare cycles to do so, but you should strongly consider purchasing a commercial license to gain access to the support team.

kw

karanveersingh5623 · October 19, 2022, 11:39am

Thanks Ken , I understand but I have no choice as of now . I am trying to build a HPC cluster POC around Bright so that I can put up across my management for buying licenses etc . Till then I just have to figure myself or hope for the help :)

kwoods · October 19, 2022, 12:00pm

Ah! That’s good to know.

In that case, the correct path forward is to obtain an evaluation license from the sales team. An eval license will give you access to the support team.

Would you like to send me your contact information via email, (kwoods@nvidia.com), and I can coordinate contact with sales?

Thanks,
kw

karanveersingh5623 · October 19, 2022, 12:04pm

I have easy8 license that’s being used right now , I tried getting the evaluation license but nobody contacted me . I will send the email to you . Appreciate it

karanveersingh5623 · October 21, 2022, 2:24am

Thanks Ken , got the requested support. Is bright website down today , I am trying to login but page getting timed out .

Topic		Replies	Views
Slurm configuration issue with PMIX Base Command Manager	3	3840	October 19, 2022
MPI error Legacy PGI Compilers	2	2082	September 27, 2012
Low performance no similar machine Legacy PGI Compilers	6	7501	June 23, 2008
about running cuda on a gpu cluster CUDA Programming and Performance	25	21644	May 31, 2010
Shared memory error using the mpirun Legacy PGI Compilers	6	20165	March 21, 2007
Error to MPI multi-node run HPC-Benchmark container enroot/pyxis Container: HPC hpc , openmpi , benchmarks	1	3123	August 26, 2022
mpi + cuda problems on mpi init CUDA Programming and Performance	0	859	June 11, 2010
problem to generate mm5.mpp Legacy PGI Compilers	3	5818	March 30, 2009
MPICH run for MM5 Legacy PGI Compilers	8	11061	January 24, 2007
Building MVAPICH2 with PGI 2010 Legacy PGI Compilers	17	30712	July 13, 2011

Slurm fails with multiple processes MPI_init errors , PML add procs failed

Related topics