Note. Created this issue here because the NVIDIA forum does not allow me to create Topic in other forum threads!!! This should really be under DOCA or Network Adapter threads.
I have a server that runs RHEL8.8 on two numa nodes, with no OS virtualization / Virtual Machines / No docker containers used in my system.
When I run the following command:
sudo lspci -t -vvv | grep -EB1 "NVIDIA|Mellanox|Switch"
| +-01.0-[c0]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| +-02.0-[c1]--+-00.0 Mellanox Technologies MT2910 Family [ConnectX-7]
| | \-00.1 Mellanox Technologies MT2910 Family [ConnectX-7]
| +-03.0-[c2]----00.0 NVIDIA Corporation Device 26b9
| +-04.0-[c3]----00.0 NVIDIA Corporation Device 26b9
| \-1f.0-[c4]----00.0 Broadcom / LSI PCIe Switch management endpoint
--
| +-02.0-[af]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| +-03.0-[b0]----00.0 NVIDIA Corporation Device 26b9
| \-04.0-[b1]----00.0 NVIDIA Corporation Device 26b9
--
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[99]--+-00.0 Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
| \-00.1 Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
--
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[3d-44]----00.0-[3e-44]--+-00.0-[3f]----00.0 NVIDIA Corporation Device 26b9
| +-01.0-[40]----00.0 NVIDIA Corporation Device 26b9
| +-02.0-[41]----00.0 NVIDIA Corporation Device 26b9
--
| +-04.0-[43]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| \-1f.0-[44]----00.0 Broadcom / LSI PCIe Switch management endpoint
--
| +-01.0-[2d]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| +-02.0-[2e]--+-00.0 Mellanox Technologies MT2910 Family [ConnectX-7]
| | \-00.1 Mellanox Technologies MT2910 Family [ConnectX-7]
| +-03.0-[2f]----00.0 Broadcom / LSI Virtual PCIe Placeholder Endpoint
| \-04.0-[30]----00.0 NVIDIA Corporation Device 26b9
The server previously had DOCA 2.7 installed and after updating to DOCA 3.0, It we see the “Broadcom / LSI Virtual PCIe Placeholder Endpoint”
cat /proc/cmdline contains iommu=off intel_iommu=off
SR-IOV should be disabled on the system.
The above server has major performance issues when crossing numa node
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 652.13 26.40 26.40 26.40 3.73 3.55 3.59 3.46 <<<<<<<<<<<<<<<<<<<
1 26.40 676.11 26.40 26.40 22.82 22.82 22.81 22.82
2 26.40 26.40 674.07 26.40 22.82 22.82 22.81 22.81
3 26.40 26.40 26.40 677.58 22.82 22.81 22.81 22.81
4 22.81 22.82 22.81 22.82 675.24 26.40 26.40 26.40
5 22.81 22.81 22.82 22.82 26.40 676.70 26.40 26.40
6 22.81 22.81 22.81 22.82 26.40 26.40 677.87 26.40
7 22.82 22.82 22.81 22.81 26.40 26.40 26.40 677.58
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 652.26 52.18 52.17 52.18 7.16 7.59 7.00 7.44 <<<<<<<<<<<<<<<<<<<
1 52.18 647.92 52.20 52.20 43.92 43.92 43.92 43.92
2 52.19 52.18 651.18 52.19 43.91 43.91 43.91 43.92
3 52.19 52.19 52.19 653.63 43.93 43.93 43.91 43.93
4 7.58 43.92 43.91 43.93 650.09 52.19 52.19 52.19
5 7.37 43.92 43.91 43.93 52.18 651.45 52.19 52.19
6 7.31 43.92 43.92 43.92 52.19 52.18 650.77 52.20
7 6.85 43.92 43.91 43.93 52.20 52.20 52.18 650.91
Comparing the above server to another server, we don’t see this issue!
I have another server with the same hardware, RHEl8.8 & DOCA 3.0 software configuration ( at least we believe it’s the same software config).
sudo lspci -t -vvv | grep -EB1 "NVIDIA|Mellanox|Switch"
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[bd-c4]----00.0-[be-c4]--+-00.0-[bf]--+-00.0 Mellanox Technologies MT2910 Family [ConnectX-7]
| | \-00.1 Mellanox Technologies MT2910 Family [ConnectX-7]
| +-01.0-[c0]----00.0 NVIDIA Corporation Device 26b9
| +-02.0-[c1]----00.0 NVIDIA Corporation Device 26b9
--
| +-04.0-[c3]--
| \-1f.0-[c4]----00.0 Broadcom / LSI PCIe Switch management endpoint
--
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[ab-b1]----00.0-[ac-b1]--+-00.0-[ad]----00.0 NVIDIA Corporation Device 26b9
| +-01.0-[ae]----00.0 NVIDIA Corporation Device 26b9
--
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[99]--+-00.0 Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
| \-00.1 Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
--
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[43-4a]----00.0-[44-4a]--+-00.0-[45]----00.0 NVIDIA Corporation Device 26b9
| +-01.0-[46]----00.0 NVIDIA Corporation Device 26b9
--
| +-04.0-[49]--
| \-1f.0-[4a]----00.0 Broadcom / LSI PCIe Switch management endpoint
--
| +-00.4 Intel Corporation Device 0b23
| \-01.0-[1e-24]----00.0-[1f-24]--+-00.0-[20]----00.0 NVIDIA Corporation Device 26b9
| +-01.0-[21]----00.0 NVIDIA Corporation Device 26b9
| +-02.0-[22]--+-00.0 Mellanox Technologies MT2910 Family [ConnectX-7]
| | \-00.1 Mellanox Technologies MT2910 Family [ConnectX-7]
My questions are as follows:
Q1. Using DOCA 3.0 with ConnectX-7, and DOCA 3.0 is Single Root IO Virtualization (SR-IOV) required for ALL use cases (doca profile -all), even when the host OS does has no virtualization/ no VMs ?
Q2. Similar.. When using DOCA 3.0 with ConnectX-7, without the use of virtualization/VMs should the mlxconfig -d <device_name> set SRIOV_EN=0 NUM_OF_VFS=0 ?
Q3. For system with No virtualization/No VMs , what performance impact, if any, will occur when SR-IOV is disabled ?
Q4. If SR-IOV is not required, then how do I make sure “Broadcom / LSI Virtual PCIe Placeholder Endpoint” is disabled and does not exist?
Q5. Can you please provide a document with steps to cleanly uninstall any previous versions of DOCA (Specifically in my case I’ve transitioned from DOCA2.7 to 3.0)