Different speed test results for different machine configuration

I have three machines:
Machine 1 : A100 4 Gpus x 40G PCIE, AMD/AMD EPYC 7662 64-Core Processor
Machine 2 : A100 4 Gpus x 80G PCIE, AuthenticAMD/ AMD EPYC 7713 64-Core Processor
Machine 3 : A100 4 Gpus x 80G SXM4 Form Factor, AuthenticAMD/AMD EPYC 7713 64-Core Processor

I have first performed MLPerf1.1 training result submitted by Nvidia in MLCommons. I have used same algorithm with same hyperparameters (same batch size even in machine 1, each gpu has half memory size).
Training speed results are following (fastest to slowest)

Machine  3 (fastest) > Machine  2 (faster) > Machine  1 (fast)

Now, I am using simple tensorflow multi gpu training example to solve semantic segmentation of Cityscapes dataset. All parameters are same here as well.
Surprisingly, here I see different training speed behavior.

Machine  2 (fastest) > Machine  1 (faster) > Machine  3 (fast)

But in single Gpu training, behavior is again different.

Machine  3 (fastest) > Machine  2 (faster) > Machine  1 (fast)

When checked tensorflow profiling of my training, I see machine 3 (sxm), has more kernel launch time and host compute time then other two machines. why do we see these higher time even machine has best processor and gpus?

why I am getting different result? what things I am missing here to fully leverage sxm form factor hardware?

tf version : 2.5.0 (tensorflow docker image)

Along with above mentioned example,
I have also tried training model using GitHub - Tramac/awesome-semantic-segmentation-pytorch: Semantic Segmentation on PyTorch (include FCN, PSPNet, Deeplabv3, Deeplabv3+, DANet, DenseASPP, BiSeNet, EncNet, DUNet, ICNet, ENet, OCNet, CCNet, PSANet, CGNet, ESPNet, LEDNet, DFANet) repo.
I still see similar training speeds, same as tensorflow developed model.

Machine  2 (fastest) > Machine  1 (faster) > Machine  3 (fast)

Here is the tutorial link which I am trying in tensorflow. (Distributed training with Keras  |  TensorFlow Core)