Hi,
I am currently trying to set up a training with TAO on a slurm cluster. On the cluster only singularity is available.
I got the training running on a single node with multi-GPU using the following slurm job script:
However, I am not quite sure how to trigger the multinode training. I studied the documentation at Working With the Containers — TAO Toolkit 3.22.05 documentation which states how to use the multi-node training with the TAO launcher, but not by invoking the container with singularity.
My attempt was to trigger the TAO container with mpirun (since the documentation stated that TAO is using OPEN-MPI + HOROVOD). The script looks like:
I can see in the slurm output, that two instances get started on the nodes, but they get canceled with
**********************************************************
mpirun does not support recursive calls
**********************************************************
Using TensorFlow backend.
but I still have some questions:
I understand the concept described in the PDF, however, the whole process of manually configuring every node seems very unpractical, when running multiple multi-node jobs on a cluster. Even though I think this configuration can be done in the context of a slurm job script, I assume it to be rather difficult.
Also in the PDF, there is no mention of using singularity instead of docker, which is also a possible source of problems in my experience.
To invoke multi-node training, simply add the --multi-node argument to the train command.
How does this work in the TAO launcher? Is the launcher also doing the same configuration when invoking the tao containers?
There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks