Hi Dear Modulus devs,
I am trying to troubleshoot an issue with multiprocessing with multi-Node using using slurm initlization. The GPU device id is not set correctly, for example using 2 DGX A100 nodes of total 16 GPUs, the device id ranges from 0 to 15, which produces CUDA error.
I am trying to troubleshooting output an variable using the container. In the I am trying to export the local_rank variable in the
initialize(), setup(), initialize_slurm() function of the
DistributedManager class in
But the print() function in the file does not produce any output. Could you please help with how to output the local_rank variable?
You can get access to the parallel information that Modulus is using using the distributed manager. Its a singleton that gets initialized on the first construction.
For example the following code can be used:
from modulus.distributed.manager import DistributedManager
# Initialize the singleton
# Get a manager object
manager = DistributedManager()
# Parallel attributes
Modulus relies on environment variables set by either Slurm or MPI to set up the ranks between processes. The device/cuda ID is either based on the assigned local rank of the process or, if no local rank is provided, calculated by
rank % torch.cuda.device_count(). Additional info is in our user guide.
Not entirely sure why your print statement is not working.
Thank you @ngeneva for your kind response. I can print the the parallel attributes outside the container. It again shows that the wrong numbering of local rank.
It is due to the fact that SLURM_LOCALID is not properly assigned by SLURM when requesting resources by requesting total number of tasks.
#SBATCH --ntasks=16 to
#SBATCH --ntasks_per_node=8, will set the SLURM_LOCALID correctly.