How to use multi-GPUs on a single mechine to run the cases in Modulus

zhangzhenthu · May 10, 2023, 12:36pm

Is that correct to run “mpirun -np XX python xx.py”? Or should I set something into the hydra config? Thanks!

zhangzhenthu · May 12, 2023, 6:13am

Hi, I tested the case three_fin_2d/heat_sink.py using one, two, and four GPUs by mpirun. Their times for 500 steps are:
1 GPU: 1m30s
2 GPUs: 1m50s
4 GPUs: 2m5s
I am confused why using more GPUs takes more time to train the same steps?
Could anyone give me some ideas? Thanks!

tsltaywb · May 16, 2023, 2:53am

Hi, if I guessed correctly, it’s because the way Modulus with multi-GPUS works is that the batch sizes on each GPU are fixed.

Hence, total batch sizes increases when more GPUs are used. If you want to fix total batch sizes, then you need to edit the yaml .e.g half batch sizes if 2 GPUs are used.

zhangzhenthu · May 17, 2023, 2:15am

I got it. Thank you!

tsltaywb · May 17, 2023, 2:24am

Btw, does anyone know how to automate it? ie divide the batch sizes in the code automatically when > 1 GPUs are used.

I tried to query the number of GPUs used using:

total_gpu = torch.cuda.device_count()
print ('Current cuda device ', torch.cuda.current_device())
if torch.cuda.current_device() == 0:
    print ('Available devices ', total_gpu)

It prints the correct no. of GPUs, but I can’t seem to use it as an integer to divide the batch sizes.

Does anyone know why?

ngeneva · May 24, 2023, 1:06am

Hi @tsltaywb

A good way to get the size of a DDP training session is the distributed manager in Modulus / Modulus-Sym.

from modulus.sym.distributed.manager import DistributedManager

# Initialize the singleton
DistributedManager.initialize()
# Get a manager object
manager = DistributedManager()

# Parallel attributes
manager.rank
manager.local_rank
manager.world_size
manager.device
cfg.batch_size = cfg.batch_size / manager.world_size # Lets adjust our batch size

ngeneva · May 24, 2023, 1:07am

Some additional information here:

tsltaywb · June 4, 2023, 6:42am

Ok thanks! I’ll try it out.

Topic		Replies	Views
Enquires about running jobs using multiple GPUs Technical Support (Modulus Only)	1	917	April 14, 2023
Testing performance on multiple GPUs Technical Support (Modulus Only)	4	1426	November 4, 2022
Multiple GPU computing CUDA Programming and Performance	8	7874	May 7, 2008
Enabling multiple GPUs Technical Support (Modulus Only) gpu	1	1388	March 29, 2023
Multiple GPUs CUDA Programming and Performance	2	1653	January 10, 2009
Multiple GPU speed problem CUDA Programming and Performance	4	1724	November 23, 2009
Training Multiple Models in one GPU in linux Frameworks	0	634	November 3, 2022
One GPU of four running slowly? CUDA Programming and Performance	4	2124	March 26, 2009
Multi-GPU - Some questions CUDA Programming and Performance	10	10736	January 21, 2010
Why the following multigpu code works faster when I set GPU_N=1 while it is slower for GPU_N=4? CUDA Programming and Performance cuda	1	627	September 21, 2020

How to use multi-GPUs on a single mechine to run the cases in Modulus

Related topics