Why multinode case takes larger times than single node even for larger number of gpu's?

tsmcnitish · November 13, 2024, 8:42am

Problem Description
I am trying to train the ML model on multinode-multigpu and compare the result with single-node:
I did various experiment including:

single node 2 gpu vs two node (1-1 gpu on each node): The time taken for each epoch is same
single node 4 gpu vs two nodes (2-2 gpu on each node): The time for each epoch in both cases is similar even in some case the time taken is larger in multinode case
single node 4 gpu vs two nodes (4-4 gpu on each node total 8gpu): In this case also the time taken for each epoch except first epoch is similar but the number of gpu in multinode is double than single node. Why is this happen?

Environment

I am using Tensorflow 2.5
python3.8
For utilizing the multiple gpu multiple node I am using Mirror Startegy and MultiMirror Strategy by Tensorflow.

Profilling:

I checked the GPU utilization in multinode vs Single node: The average gpu utilization over the training period is getting halved in multinode case
I did the profillling of the process also using nvidia insight system but couldn’t get any conlcuion from there.
Checked the bandwidth: looks fine

What could be the possible reson behind this?

We need to find the bottleneck so that we can fix the server and use the fukll power of server
is it communication overhead issue? or something else? If yes then how to fix it.

Thanks
Nitish

AakankshaS · November 30, 2024, 8:41am

Hi @tsmcnitish ,
This forum talks about issues specific to TRT,
Raisning it to TF Github Page may help.

Thanks

Topic		Replies	Views
TensorRT unnecessary synchronization in multi-GPU system TensorRT tensorrt , performance , synchronization	7	1455	January 23, 2023
the inference time increases linearly when running more than 2 tensorrt instance on single GPU TensorRT	1	1584	April 4, 2019
Training a TLT model with multiple computers TAO Toolkit	9	672	October 12, 2021
Multi-process running tensorRT Jetson AGX Xavier tensorrt	5	1593	October 18, 2021
TensorRT Inference server low performance with 8 GPUs Triton Inference Server (archived)	2	942	September 10, 2019
how to run trt in multithreading？ Jetson TX2	15	8027	October 18, 2021
How are CUDA resources allocated under dual processes? TensorRT	3	367	June 21, 2022
Training Multiple Models in one GPU in linux Frameworks (archived)	0	649	November 3, 2022
Multi-GPU Training time is slower than single-GPU CUDA Programming and Performance	0	469	February 2, 2023
Why multi-GPU does not work better? CUDA Programming and Performance	2	835	July 24, 2015

Why multinode case takes larger times than single node even for larger number of gpu's?

Environment

Profilling:

What could be the possible reson behind this?

Related topics