I read the following threads:
and also watched the following video:
Which led me to the following question:
In my group, we are interested in buying a server with 8 Nvidia A40 GPUs, such that those 8 GPUs are split into 4 groups of 2, where each pair of GPUs are physically connected using a NVLink bridge.
I wonder how using 4 pairs of NVLink GPUs will affect the utilization of data parallelism and model sharding. How would it be different compared to using the same 8 GPUs without any NVLink bridges between them?