We have physical computer node with A800 GPU*8, and its output of “nvidia-smi topo -m” is shown below:
However, when a virtual machine with A800 * 4 GPU created on the node, and we ran benchmark test on it. As shown in the above figure, the bidirection bandwidth was closed to 100GB/s and the single bandwidth was closed to 50GB/s.
It makes me confused that in 4GPU virtual machine, each GPU has 4 nvlinks. The nvlinke supports 50 GB/s bidirectional bandwidth, so the read&write bidirectional bandwidth on single GPU should be closed to 50GB/s * 4 = 200GB/s and single direction bandwidth should be closed to 25GB/s * 4 = 100GB/s.
Is there something wrong in my interpretation of aggregate bandwidth in A800, or in my understanding of NVLink?