Is that correct to run “mpirun -np XX python xx.py”? Or should I set something into the hydra config? Thanks!
Hi, I tested the case three_fin_2d/heat_sink.py using one, two, and four GPUs by mpirun. Their times for 500 steps are:
1 GPU: 1m30s
2 GPUs: 1m50s
4 GPUs: 2m5s
I am confused why using more GPUs takes more time to train the same steps?
Could anyone give me some ideas? Thanks!
Hi, if I guessed correctly, it’s because the way Modulus with multi-GPUS works is that the batch sizes on each GPU are fixed.
Hence, total batch sizes increases when more GPUs are used. If you want to fix total batch sizes, then you need to edit the yaml .e.g half batch sizes if 2 GPUs are used.
I got it. Thank you!
Btw, does anyone know how to automate it? ie divide the batch sizes in the code automatically when > 1 GPUs are used.
I tried to query the number of GPUs used using:
total_gpu = torch.cuda.device_count()
print ('Current cuda device ', torch.cuda.current_device())
if torch.cuda.current_device() == 0:
print ('Available devices ', total_gpu)
It prints the correct no. of GPUs, but I can’t seem to use it as an integer to divide the batch sizes.
Does anyone know why?
Hi @tsltaywb
A good way to get the size of a DDP training session is the distributed manager in Modulus / Modulus-Sym.
from modulus.sym.distributed.manager import DistributedManager
# Initialize the singleton
DistributedManager.initialize()
# Get a manager object
manager = DistributedManager()# Parallel attributes
manager.rank
manager.local_rank
manager.world_size
manager.device
cfg.batch_size = cfg.batch_size / manager.world_size # Lets adjust our batch size
Some additional information here:
Ok thanks! I’ll try it out.