Can someone explain these bandwidth tests to me:
P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 590.51 17.75 17.74 17.74 21.66 21.50 21.47 21.57
1 17.72 592.08 17.80 17.68 21.58 21.53 21.53 21.65
2 17.73 17.77 593.88 17.70 21.48 21.55 21.54 21.62
3 17.73 17.76 17.75 591.86 21.70 21.58 21.53 21.59
4 21.74 21.71 21.78 21.64 591.41 18.10 18.11 18.01
5 21.78 21.81 21.78 21.66 18.12 591.18 18.18 18.06
6 21.74 21.81 21.77 21.76 18.18 18.12 591.41 18.08
7 21.68 21.74 21.84 21.79 18.15 18.15 18.19 591.41
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 592.08 52.67 2.07 1.62 18.60 18.60 18.60 18.60
1 52.76 593.66 1.96 1.78 18.60 18.60 18.60 18.60
2 2.07 1.59 591.41 52.77 18.60 18.60 18.57 18.60
3 2.27 1.65 52.78 593.20 18.60 18.58 18.58 18.60
4 18.60 18.60 18.60 18.60 591.63 52.71 2.07 1.55
5 18.60 18.53 18.50 18.49 52.74 594.11 2.10 1.78
6 18.60 18.60 18.60 18.60 2.07 1.78 592.98 52.75
7 18.48 18.52 18.41 18.52 1.66 1.61 52.74 592.75
We have a server with 8 RTX A6000 with 2 Xeon Gold 6348 on a SuperMicro SYS-420GP-TNR, the GPUs are connected pairwise with NVLink.
I would assume that GPUs 0-3 are physically connected to the first CPU and 4-7 to the second one, and they can communicate via UPI, with a PCIe 4.0 x16 Switch Dual-Root setup.
Can someone explain in the “Unidirectional with P2P disabled”, why the bandwidth is higher from one group to the other, but lower within a group.
And even more, in “Unidirectional with P2P enabled”, peer access now is high, but to the two others in the group is extremely slow, while to the other group it is okay again.
What am I missing here?
Thanks in advance!