Hi Folks , used slurm workload manager in Bright Cluster manager 9.2 and just edited the slurm.conf file for GRES configurations , MPI_init errors showing up when running srun or sbatch with multiple processes , single process passes . Below is the trace
FYI , using enroot + pyxis for running containers with slurm .
[root@bright88 mxnet]# srun --kill-on-bad-exit=0 --mpi=pmix -G 4 --ntasks=4 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
running benchmark
STARTING TIMING RUN AT 2022-10-19 08:19:08 PM
running benchmark
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=48-63,112-127 --membind=1 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=32-47,96-111 --membind=1 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
+ exec numactl --physcpubind=0-15,64-79 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
+ exec numactl --physcpubind=16-31,80-95 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.004 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 0 --lr-scheduler-epochs 16 32 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle --use-amp
[node001:100528] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100525] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100532] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:100527] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[1666178350.584090] [node001:100532:0] rc_mlx5_devx.c:99 UCX ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported
[node001:100532] pml_ucx.c:309 Error: Failed to create UCP worker
[node001:100525] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100525] *** An error occurred in MPI_Init_thread
[node001:100525] *** reported by process [142357014,3]
[node001:100525] *** on a NULL communicator
[node001:100525] *** Unknown error
[node001:100525] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100525] *** and potentially your MPI job)
[node001:100528] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100528] *** An error occurred in MPI_Init_thread
[node001:100528] *** reported by process [142357014,2]
[node001:100528] *** on a NULL communicator
[node001:100528] *** Unknown error
[node001:100528] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100528] *** and potentially your MPI job)
[node001:100527] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:100527] *** An error occurred in MPI_Init_thread
[node001:100527] *** reported by process [142357014,1]
[node001:100527] *** on a NULL communicator
[node001:100527] *** Unknown error
[node001:100527] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:100527] *** and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 213.0 ON node001 CANCELLED AT 2022-10-19T20:19:10 ***
srun: error: node001: tasks 0-3: Killed