Model sharding, data parallelism and NVLink

Hi Guys,
I read the following threads:
https://forums.developer.nvidia.com/t/can-nvlink-combine-2x-gpus-into-1x-big-gpu/73594

https://discuss.pytorch.org/t/split-single-model-in-multiple-gpus/13239

and also watched the following video:
https://youtu.be/_d3xs1L4jeA

Which led me to the following question:
In my group, we are interested in buying a server with 8 Nvidia A40 GPUs, such that those 8 GPUs are split into 4 groups of 2, where each pair of GPUs are physically connected using a NVLink bridge.
I wonder how using 4 pairs of NVLink GPUs will affect the utilization of data parallelism and model sharding. How would it be different compared to using the same 8 GPUs without any NVLink bridges between them?

Thanks

Hi,

This doesn’t look like cuDNN related. We recommend you to please post your concern on related platform to get better help.

Thank you.

CuDNN today does not support actively partitioning the computations onto multiple GPUs
Some of the DL frameworks might support those features, it’s possible that you might also modify your model training scripts to achieve it. If you are able to utilize those features, faster communication through NVLink will definitely speed up the process (compared to the slower PCI-E).