Print output for troubleshooting multiNode run

Hi Dear Modulus devs,

I am trying to troubleshoot an issue with multiprocessing with multi-Node using using slurm initlization. The GPU device id is not set correctly, for example using 2 DGX A100 nodes of total 16 GPUs, the device id ranges from 0 to 15, which produces CUDA error.

I am trying to troubleshooting output an variable using the container. In the I am trying to export the local_rank variable in the initialize(), setup(), initialize_slurm() function of the DistributedManager class in $CONTAINER_HOME/modulus/modulus/distributed/manager.py.

But the print() function in the file does not produce any output. Could you please help with how to output the local_rank variable?

Thanks!

Hi @yunchaoyang

You can get access to the parallel information that Modulus is using using the distributed manager. Its a singleton that gets initialized on the first construction.

For example the following code can be used:

from modulus.distributed.manager import DistributedManager

# Initialize the singleton
DistributedManager.initialize()
# Get a manager object
manager = DistributedManager()

# Parallel attributes
manager.rank
manager.local_rank
manager.world_size
manager.device

Modulus relies on environment variables set by either Slurm or MPI to set up the ranks between processes. The device/cuda ID is either based on the assigned local rank of the process or, if no local rank is provided, calculated by rank % torch.cuda.device_count(). Additional info is in our user guide.

Not entirely sure why your print statement is not working.

1 Like

Thank you @ngeneva for your kind response. I can print the the parallel attributes outside the container. It again shows that the wrong numbering of local rank.

It is due to the fact that SLURM_LOCALID is not properly assigned by SLURM when requesting resources by requesting total number of tasks.

Change #SBATCH --ntasks=16 to #SBATCH --ntasks_per_node=8, will set the SLURM_LOCALID correctly.

1 Like