Why multinode case takes larger times than single node even for larger number of gpu's?

Problem Description
I am trying to train the ML model on multinode-multigpu and compare the result with single-node:
I did various experiment including:

  1. single node 2 gpu vs two node (1-1 gpu on each node): The time taken for each epoch is same
  2. single node 4 gpu vs two nodes (2-2 gpu on each node): The time for each epoch in both cases is similar even in some case the time taken is larger in multinode case
  3. single node 4 gpu vs two nodes (4-4 gpu on each node total 8gpu): In this case also the time taken for each epoch except first epoch is similar but the number of gpu in multinode is double than single node. Why is this happen?

Environment

I am using Tensorflow 2.5
python3.8
For utilizing the multiple gpu multiple node I am using Mirror Startegy and MultiMirror Strategy by Tensorflow.

Profilling:

  1. I checked the GPU utilization in multinode vs Single node: The average gpu utilization over the training period is getting halved in multinode case
  2. I did the profillling of the process also using nvidia insight system but couldn’t get any conlcuion from there.
  3. Checked the bandwidth: looks fine

What could be the possible reson behind this?

  1. We need to find the bottleneck so that we can fix the server and use the fukll power of server
  2. is it communication overhead issue? or something else? If yes then how to fix it.

Thanks
Nitish

Hi @tsmcnitish ,
This forum talks about issues specific to TRT,
Raisning it to TF Github Page may help.

Thanks