P2P Bandwidth measurements

Can someone explain these bandwidth tests to me:

P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 590.51 17.75 17.74 17.74 21.66 21.50 21.47 21.57
1 17.72 592.08 17.80 17.68 21.58 21.53 21.53 21.65
2 17.73 17.77 593.88 17.70 21.48 21.55 21.54 21.62
3 17.73 17.76 17.75 591.86 21.70 21.58 21.53 21.59
4 21.74 21.71 21.78 21.64 591.41 18.10 18.11 18.01
5 21.78 21.81 21.78 21.66 18.12 591.18 18.18 18.06
6 21.74 21.81 21.77 21.76 18.18 18.12 591.41 18.08
7 21.68 21.74 21.84 21.79 18.15 18.15 18.19 591.41
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 592.08 52.67 2.07 1.62 18.60 18.60 18.60 18.60
1 52.76 593.66 1.96 1.78 18.60 18.60 18.60 18.60
2 2.07 1.59 591.41 52.77 18.60 18.60 18.57 18.60
3 2.27 1.65 52.78 593.20 18.60 18.58 18.58 18.60
4 18.60 18.60 18.60 18.60 591.63 52.71 2.07 1.55
5 18.60 18.53 18.50 18.49 52.74 594.11 2.10 1.78
6 18.60 18.60 18.60 18.60 2.07 1.78 592.98 52.75
7 18.48 18.52 18.41 18.52 1.66 1.61 52.74 592.75

We have a server with 8 RTX A6000 with 2 Xeon Gold 6348 on a SuperMicro SYS-420GP-TNR, the GPUs are connected pairwise with NVLink.
I would assume that GPUs 0-3 are physically connected to the first CPU and 4-7 to the second one, and they can communicate via UPI, with a PCIe 4.0 x16 Switch Dual-Root setup.
Can someone explain in the “Unidirectional with P2P disabled”, why the bandwidth is higher from one group to the other, but lower within a group.
And even more, in “Unidirectional with P2P enabled”, peer access now is high, but to the two others in the group is extremely slow, while to the other group it is okay again.

What am I missing here?
Thanks in advance!

What I forgot to mention, the decreased bandwidth seems to be caused by incredibly high latency, as can be seen here:

P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   1.71  19.28  12.22  18.24  11.49  13.47  11.66  11.60
     1  13.62   1.70  13.67  17.05  11.43  11.93  12.61  12.40
     2  20.54  11.44   1.63  15.50  12.10  11.45  11.61  11.49
     3  14.25  14.22  16.88   1.70  14.11  12.83  12.36  12.36
     4  15.77  12.37  16.41  14.52   1.55  20.22  20.53  20.51
     5  14.54  12.60  17.20  16.15  20.10   1.61  11.28  20.54
     6  12.36  17.43  12.44  12.53  11.97  11.47   1.50  20.53
     7  12.54  17.23  13.30  12.51  11.62  11.26  20.53   1.53

   CPU     0      1      2      3      4      5      6      7
     0   2.65   8.51   8.33   9.65   8.88   7.65   7.69   7.65
     1   7.47   2.66   7.38   8.39   7.79   6.58   6.54   6.57
     2   7.31   7.27   2.64   8.31   7.76   6.50   6.48   6.50
     3   8.56   8.36   8.25   2.75   8.76   7.61   7.56   7.56
     4   8.00   7.85   7.83   9.05   3.17   7.16   7.15   7.15
     5   6.89   6.75   6.74   7.94   7.32   2.43   6.14   6.14
     6   6.85   6.77   6.72   7.95   7.34   6.13   2.43   6.15
     7   6.90   6.77   6.78   8.01   7.41   6.17   6.14   2.39
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3      4      5      6      7
     0   1.71   1.59 49204.81 49204.89   1.77   1.77   1.72   1.76
     1   1.55   1.70 49204.72 49204.70   2.33   2.36   2.37   2.35
     2 49204.73 49204.73   1.51   1.52   2.36   2.36   2.36   2.41
     3 49204.57 49204.53   1.58   1.61   2.38   2.36   2.35   2.39
     4   2.30   2.33   1.82   1.82   1.78   1.68 49204.80 49204.72
     5   2.36   2.31   2.31   1.80   1.64   1.73 49204.43 49204.43
     6   2.33   2.30   2.33   2.32 49204.79 49204.81   1.51   1.57
     7   2.34   2.31   2.31   2.33 49204.69 49204.72   1.57   1.54

   CPU     0      1      2      3      4      5      6      7
     0   2.63   2.36   2.47   2.51   2.61   2.51   2.43   2.43
     1   2.03   2.68   2.12   2.17   2.10   2.07   2.07   2.06
     2   2.02   1.99   2.77   2.02   2.08   2.06   2.09   2.06
     3   2.48   2.40   2.45   2.71   2.46   2.48   2.46   2.48
     4   2.23   2.14   2.15   2.15   3.09   2.20   2.27   2.29
     5   2.01   1.87   1.79   1.78   1.84   2.48   1.95   1.95
     6   1.83   1.77   1.76   1.80   1.74   1.91   2.61   1.86
     7   1.84   1.78   1.79   1.80   1.78   1.85   1.89   2.48

Why would enabling NVLink increase the latency by that much to the other GPUs?

Nvidia-smi reports the following information:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV4     PXB     PXB     SYS     SYS     SYS     SYS     0-27,56-83      0               N/A
GPU1    NV4      X      PXB     PXB     SYS     SYS     SYS     SYS     0-27,56-83      0               N/A
GPU2    PXB     PXB      X      NV4     SYS     SYS     SYS     SYS     0-27,56-83      0               N/A
GPU3    PXB     PXB     NV4      X      SYS     SYS     SYS     SYS     0-27,56-83      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NV4     PXB     PXB     28-55,84-111    1               N/A
GPU5    SYS     SYS     SYS     SYS     NV4      X      PXB     PXB     28-55,84-111    1               N/A
GPU6    SYS     SYS     SYS     SYS     PXB     PXB      X      NV4     28-55,84-111    1               N/A
GPU7    SYS     SYS     SYS     SYS     PXB     PXB     NV4      X      28-55,84-111    1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks