Multi-node training with TAO on Slurm cluster

Hi,
I am currently trying to set up a training with TAO on a slurm cluster. On the cluster only singularity is available.
I got the training running on a single node with multi-GPU using the following slurm job script:

#!/bin/bash
#SBATCH --gres=gpu:2

singularity exec \
    --nv -B /data \
    tao-toolkit-tf_v3.22.05-tf1.15.5-py3.sif yolo_v4 train \
		--gpus 2 \
		-e <...>/spec.txt \
		-r <...>/results \
	 	-k nvidia_tao

This is working fine.

However, I am not quite sure how to trigger the multinode training. I studied the documentation at Working With the Containers — TAO Toolkit 3.22.05 documentation which states how to use the multi-node training with the TAO launcher, but not by invoking the container with singularity.
My attempt was to trigger the TAO container with mpirun (since the documentation stated that TAO is using OPEN-MPI + HOROVOD). The script looks like:

#!/bin/bash
#SBATCH --gres=gpu:2
#SBATCH -N 2
#SBATCH --ntasks-per-node=1

# load btrp7nc openmpi@4.1.3%gcc@11.2.0
spack load /btrp7nc

mpirun -np $SLURM_NTASKS singularity exec \
    --nv -B /data \
    tao-toolkit-tf_v3.22.05-tf1.15.5-py3.sif yolo_v4 train \
		--gpus 2 \
		-e <...>/spec.txt \
		-r <...>/results \
	 	-k nvidia_tao \
		--multi-node

I can see in the slurm output, that two instances get started on the nodes, but they get canceled with

**********************************************************

mpirun does not support recursive calls

**********************************************************
Using TensorFlow backend.

I hope you have some hints for me.
Thank you

Please refer to attached file.
There are examples for running yolov4, detectnet_v2 and Unet.
tao_multi_node_training_EA.pdf (704.3 KB)

Thanks for the PDF @Morganh,

but I still have some questions:
I understand the concept described in the PDF, however, the whole process of manually configuring every node seems very unpractical, when running multiple multi-node jobs on a cluster. Even though I think this configuration can be done in the context of a slurm job script, I assume it to be rather difficult.
Also in the PDF, there is no mention of using singularity instead of docker, which is also a possible source of problems in my experience.

To invoke multi-node training, simply add the --multi-node argument to the train command.

How does this work in the TAO launcher? Is the launcher also doing the same configuration when invoking the tao containers?

Thank you

For singularity , please refer to Frequently Asked Questions — TAO Toolkit 3.22.05 documentation

The launcher will not setup the configuration. So, it does not work in TAO launcher.

The “-x NCCL_IB_HCA=mlx5_4,mlx5_6,mlx5_8,mlx5_10 -x NCCL_SOCKET_IFNAME=^lo,docker” can work fine in Nvidia ngc cloud machines.

The pdf provides the common use of setting two nodes.