Slurm configuration issue with PMIX

Hi Folks

Below is the bash file I am trying to run inside a enroot container using pyxis plugin with srun , one python command is executing but other fails , please refer the file below and trace .

#!/bin/bash

DATA_SRC_DIR=$1
DATA_DST_DIR=$2

python3 -m tools.convert_tfrecord_to_numpy -i ${DATA_SRC_DIR}/train -o ${DATA_DST_DIR}/train -c gzip
ls -1 ${DATA_DST_DIR}/train | grep "_data.npy" | sort > ${DATA_DST_DIR}/train/files_data.lst
ls -1 ${DATA_DST_DIR}/train | grep "_label.npy" | sort > ${DATA_DST_DIR}/train/files_label.lst

python3 -m tools.convert_tfrecord_to_numpy -i ${DATA_SRC_DIR}/validation -o ${DATA_DST_DIR}/validation -c gzip
ls -1 ${DATA_DST_DIR}/validation | grep "_data.npy" | sort > ${DATA_DST_DIR}/validation/files_data.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_label.npy" | sort > ${DATA_DST_DIR}/validation/files_label.lst

[root@bright88 ~]# SLURM_DEBUG=2 srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash" --mpi=pmix_v3 -N 1 -G 1 --ntasks=1 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=cosmoflow-preprocess --container-workdir=/mnt/mxnet --container-mounts=/mnt/lustre:/mnt bash /mnt/mxnet/tools/init_datasets.sh /mnt/cosmoUniverse_2019_05_4parE_tf_small /mnt/processed


srun: select/cons_res: common_init: select/cons_res loaded
srun: select/cons_tres: common_init: select/cons_tres loaded
srun: select/linear: init: Linear node selection plugin loaded with argument 4
srun: debug:  switch/none: init: switch NONE plugin loaded
srun: debug:  spank: opening plugin stack /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf
srun: debug:  /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf: 1: include "/cm/shared/apps/slurm/var/etc/slurm/plugstack.conf.d/*"
srun: debug:  spank: opening plugin stack /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf.d/pyxis.conf
srun: debug:  spank: /cm/shared/apps/slurm/var/etc/slurm/plugstack.conf.d/pyxis.conf:1: Loaded plugin spank_pyxis.so
srun: debug:  SPANK: appending plugin option "container-image"
srun: debug:  SPANK: appending plugin option "container-mounts"
srun: debug:  SPANK: appending plugin option "container-workdir"
srun: debug:  SPANK: appending plugin option "container-name"
srun: debug:  SPANK: appending plugin option "container-save"
srun: debug:  SPANK: appending plugin option "container-mount-home"
srun: debug:  SPANK: appending plugin option "no-container-mount-home"
srun: debug:  SPANK: appending plugin option "container-remap-root"
srun: debug:  SPANK: appending plugin option "no-container-remap-root"
srun: debug:  SPANK: appending plugin option "container-entrypoint"
srun: debug:  SPANK: appending plugin option "no-container-entrypoint"
srun: launch/slurm: init: launch Slurm plugin loaded
srun: debug:  mpi type = pmix_v3
srun: debug:  mpi/pmix_v3: init: PMIx plugin loaded
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=18446744073709551615
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=254374
srun: debug:  propagating RLIMIT_NOFILE=131072
srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 43950
srun: debug:  Entering _msg_thr_internal
srun: debug:  auth/munge: init: Munge authentication plugin loaded
srun: Waiting for nodes to boot (delay looping 450 times @ 0.100000 secs x index)
srun: Nodes node001 are ready for job
srun: jobid 131: nodes(1):`node001', cpu counts: 2(x1)
srun: debug:  requesting job 131, user 0, nodes 1 including (node001)
srun: debug:  cpus 1, tasks 1, name bash, relative 65534
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  mpi/pmix_v3: pmixp_abort_agent_start: (null) [0]: pmixp_agent.c:376: Abort agent port: 37785
srun: debug:  mpi/pmix_v3: p_mpi_hook_client_prelaunch: (null) [0]: mpi_pmix.c:224: setup process mapping in srun
srun: debug:  mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:352: Start abort thread
srun: debug:  Entering _msg_thr_create()
srun: debug:  initialized stdio listening socket, port 37471
srun: debug:  Started IO server thread (46912584451840)
srun: debug:  Entering _launch_tasks
srun: launching StepId=131.0 on host node001, 1 tasks: 0
srun: route/default: init: route default plugin loaded
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: launch/slurm: _task_start: Node node001, 1 tasks started
[1666063323.785314] [node001:93606:0]    rc_mlx5_devx.c:99   UCX  ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported
[node001:93606] pml_ucx.c:309  Error: Failed to create UCP worker
2022-10-18 12:22:03.830467: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-18 12:22:04.429732: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-10-18 12:22:05.024639: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.109382: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.183814: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.255529: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.327216: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.398774: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.472383: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.544805: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.618955: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.692236: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.763324: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.834954: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.906643: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:05.976741: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.047337: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.117514: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.189197: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.260510: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.337110: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.408978: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.479989: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.550954: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.621199: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.692573: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.762497: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.833505: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.904602: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:06.974962: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:07.046702: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:07.116413: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:07.186980: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-10-18 12:22:07.258767: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
Found 32 files, 0 are done, 32 are remaining.
[node001:94018] PMIX ERROR: NOT-FOUND in file ptl_usock.c at line 175
[node001:94018] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node001:94018] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
/mnt/mxnet/tools/init_datasets.sh: line 15: /mnt/processed/validation/files_data.lst: No such file or directory
ls: cannot access '/mnt/processed/validation': No such file or directory
/mnt/mxnet/tools/init_datasets.sh: line 16: /mnt/processed/validation/files_label.lst: No such file or directory
ls: cannot access '/mnt/processed/validation': No such file or directory
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=131.0 (status=0x0100).
srun: error: node001: task 0: Exited with exit code 1
srun: debug:  task 0 done
srun: debug:  IO thread exiting
srun: debug:  mpi/pmix_v3: _conn_readable: (null) [0]: pmixp_agent.c:103:     false, shutdown
srun: debug:  mpi/pmix_v3: _pmix_abort_thread: (null) [0]: pmixp_agent.c:354: Abort thread exit
srun: debug:  Leaving _msg_thr_internal
[root@bright88 ~]# srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: pmix
srun: pmix_v3

Is it possible to simplify the jobscript in such a way that we will be able reproduce the issue without your src data?

I guess my issue is similar to this one , runnning any 2 executables python or exe in a bash script.
Can we achieve this by configuring 2 different ranks for 2 different processes ?