I have three machines:
Machine 1 : A100 4 Gpus x 40G PCIE, AMD/AMD EPYC 7662 64-Core Processor
Machine 2 : A100 4 Gpus x 80G PCIE, AuthenticAMD/ AMD EPYC 7713 64-Core Processor
Machine 3 : A100 4 Gpus x 80G SXM4 Form Factor, AuthenticAMD/AMD EPYC 7713 64-Core Processor
I have first performed MLPerf1.1 training result submitted by Nvidia in MLCommons. I have used same algorithm with same hyperparameters (same batch size even in machine 1, each gpu has half memory size).
Training speed results are following (fastest to slowest)
Machine 3 (fastest) > Machine 2 (faster) > Machine 1 (fast)
Now, I am using simple tensorflow multi gpu training example to solve semantic segmentation of Cityscapes dataset. All parameters are same here as well.
Surprisingly, here I see different training speed behavior.
Machine 2 (fastest) > Machine 1 (faster) > Machine 3 (fast)
But in single Gpu training, behavior is again different.
Machine 3 (fastest) > Machine 2 (faster) > Machine 1 (fast)
When checked tensorflow profiling of my training, I see machine 3 (sxm), has more kernel launch time and host compute time then other two machines. why do we see these higher time even machine has best processor and gpus?
why I am getting different result? what things I am missing here to fully leverage sxm form factor hardware?
tf version : 2.5.0 (tensorflow docker image)
Along with above mentioned example,
I have also tried training model using GitHub - Tramac/awesome-semantic-segmentation-pytorch: Semantic Segmentation on PyTorch (include FCN, PSPNet, Deeplabv3, Deeplabv3+, DANet, DenseASPP, BiSeNet, EncNet, DUNet, ICNet, ENet, OCNet, CCNet, PSANet, CGNet, ESPNet, LEDNet, DFANet) repo.
I still see similar training speeds, same as tensorflow developed model.
Machine 2 (fastest) > Machine 1 (faster) > Machine 3 (fast)