Model sharding, data parallelism and NVLink

In my group, we are interested in buying a server with 8 Nvidia A40 GPUs, such that those 8 GPUs are split into 4 groups of 2, where each pair of GPUs are physically connected using a NVLink bridge.
I wonder how using 4 pairs of NVLink GPUs will affect the utilization of data parallelism and model sharding. How would it be different compared to using the same 8 GPUs without any NVLink bridges between them?



This doesn’t look like cuDNN related. We recommend you to please post your concern on related platform to get better help.

CuDNN today does not support actively partitioning the computations onto multiple GPUs
Some of the DL frameworks might support those features, it’s possible that you might also modify your model training scripts to achieve it. If you are able to utilize those features, faster communication through NVLink will definitely speed up the process (compared to the slower PCI-E).