Multi-node training with TAO on Slurm cluster

vovea · September 6, 2022, 3:28pm

Hi,
I am currently trying to set up a training with TAO on a slurm cluster. On the cluster only singularity is available.
I got the training running on a single node with multi-GPU using the following slurm job script:

#!/bin/bash
#SBATCH --gres=gpu:2

singularity exec \
    --nv -B /data \
    tao-toolkit-tf_v3.22.05-tf1.15.5-py3.sif yolo_v4 train \
		--gpus 2 \
		-e <...>/spec.txt \
		-r <...>/results \
	 	-k nvidia_tao

This is working fine.

However, I am not quite sure how to trigger the multinode training. I studied the documentation at Working With the Containers — TAO Toolkit 3.22.05 documentation which states how to use the multi-node training with the TAO launcher, but not by invoking the container with singularity.
My attempt was to trigger the TAO container with mpirun (since the documentation stated that TAO is using OPEN-MPI + HOROVOD). The script looks like:

#!/bin/bash
#SBATCH --gres=gpu:2
#SBATCH -N 2
#SBATCH --ntasks-per-node=1

# load btrp7nc openmpi@4.1.3%gcc@11.2.0
spack load /btrp7nc

mpirun -np $SLURM_NTASKS singularity exec \
    --nv -B /data \
    tao-toolkit-tf_v3.22.05-tf1.15.5-py3.sif yolo_v4 train \
		--gpus 2 \
		-e <...>/spec.txt \
		-r <...>/results \
	 	-k nvidia_tao \
		--multi-node

I can see in the slurm output, that two instances get started on the nodes, but they get canceled with

**********************************************************

mpirun does not support recursive calls

**********************************************************
Using TensorFlow backend.

I hope you have some hints for me.
Thank you

Morganh · September 7, 2022, 6:27am

Please refer to attached file.
There are examples for running yolov4, detectnet_v2 and Unet.
tao_multi_node_training_EA.pdf (704.3 KB)

vovea · September 13, 2022, 7:50am

Thanks for the PDF @Morganh,

but I still have some questions:
I understand the concept described in the PDF, however, the whole process of manually configuring every node seems very unpractical, when running multiple multi-node jobs on a cluster. Even though I think this configuration can be done in the context of a slurm job script, I assume it to be rather difficult.
Also in the PDF, there is no mention of using singularity instead of docker, which is also a possible source of problems in my experience.

To invoke multi-node training, simply add the --multi-node argument to the train command.

How does this work in the TAO launcher? Is the launcher also doing the same configuration when invoking the tao containers?

Thank you

Morganh · September 19, 2022, 8:58am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

For singularity , please refer to Frequently Asked Questions — TAO Toolkit 3.22.05 documentation

The launcher will not setup the configuration. So, it does not work in TAO launcher.

The “-x NCCL_IB_HCA=mlx5_4,mlx5_6,mlx5_8,mlx5_10 -x NCCL_SOCKET_IFNAME=^lo,docker” can work fine in Nvidia ngc cloud machines.

The pdf provides the common use of setting two nodes.

system · October 24, 2022, 8:54am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error when training with multiple GPUs in TAO TAO Toolkit	17	1996	May 4, 2023
More than 1 GPU not working using Tao Train TAO Toolkit	47	4668	April 9, 2023
Tao setup with fine tunning of PCN TAO Toolkit	5	34	June 12, 2025
TAO Qucik Start notebook Guide Error TAO Toolkit	2	358	October 3, 2023
LPRNet Not Available in TAO Toolkit 5.0.0 TAO Toolkit	3	29	July 1, 2025
The container stops in between TAO training TAO Toolkit	3	35	December 9, 2024
Tao Launcher docker configs (not a TTY) TAO Toolkit docker , interactive , tao	7	33	February 3, 2025
Tao Auto ML setup/installation issue for bare metal(single node/local deployment) TAO Toolkit tao , jetson	1	34	March 22, 2025
RTX 6000 Blackwell support on TAO TAO Toolkit	1	12	July 5, 2025
Would like some help in running the xhpl 21.4 container on slurm Container: HPC	0	1151	November 4, 2022

Multi-node training with TAO on Slurm cluster

Related topics