Use RoCE with ovs hardware offload

Hello all together,

I have a little problem to configure my connectx-5 vdi cards correctly.

The issue is the following:

I would like to use RoCE and ovs with hardware offloads parallel on my card.
For this I have to change the devlink mode to switchdev according to the instructions. But after that RoCE no longer work.

General:

# ethtool -i enp1s0f0np0 | head -5
driver: mlx5_core
version: 5.8-2.0.3
firmware-version: 16.35.2000 (MT_0000000013)
expansion-rom-version: 
bus-info: 0000:01:00.0
# ethtool -i enp1s0f1np1 | head -5
driver: mlx5_core
version: 5.8-2.0.3
firmware-version: 16.35.2000 (MT_0000000013)
expansion-rom-version: 
bus-info: 0000:01:00.1

Network config:

auto enp1s0f0np0
iface enp1s0f0np0 inet manual

auto enp1s0f0np0.2
iface enp1s0f0np0.2 inet static
    address  10.15.15.1/24
    up ip route add 10.15.15.3/32 dev enp1s0f0np0.2
    down ip route del 10.15.15.3/32

auto enp1s0f1np1
iface enp1s0f1np1 inet manual

auto enp1s0f1np1.2
iface enp1s0f1np1.2 inet static
    address  10.15.15.1/24
    up ip route add 10.15.15.2/32 dev enp1s0f1np1.2
    down ip route del 10.15.15.2/32

systemd service

Is only enabled in the after step

# cat  /etc/systemd/system/mlxn_ofed.service
[Unit]
Description=Configure nvidia connectx network card
Before=network.target
After=openibd.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=bash -c "mlnx_qos -i enp1s0f0np0 --pfc 0,0,0,0,1,0,0,0"
ExecStart=bash -c "echo 2 > /sys/class/net/enp1s0f0np0/device/sriov_numvfs"
ExecStart=bash -c "echo 0000:01:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind"
ExecStart=bash -c "echo 0000:01:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind"
ExecStart=bash -c "devlink dev eswitch set pci/0000:01:00.0 mode switchdev"
ExecStart=bash -c "echo 0000:01:00.2 > /sys/bus/pci/drivers/mlx5_core/bind"
ExecStart=bash -c "echo 0000:01:00.3 > /sys/bus/pci/drivers/mlx5_core/bind"
ExecStart=bash -c "mlnx_qos -i enp1s0f1np1 --pfc 0,0,0,0,1,0,0,0"
ExecStart=bash -c "echo 2 > /sys/class/net/enp1s0f1np1/device/sriov_numvfs"
ExecStart=bash -c "echo 0000:01:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind"
ExecStart=bash -c "echo 0000:01:01.3 > /sys/bus/pci/drivers/mlx5_core/unbind"
ExecStart=bash -c "devlink dev eswitch set pci/0000:01:00.1 mode switchdev"
ExecStart=bash -c "echo 0000:01:01.2 > /sys/bus/pci/drivers/mlx5_core/bind"
ExecStart=bash -c "echo 0000:01:01.3 > /sys/bus/pci/drivers/mlx5_core/bind"

[Install]
WantedBy=multi-user.target

Test

Before:

# show_gids
DEV	PORT	INDEX	GID					IPv4  		VER	DEV
---	----	-----	---					------------  	---	---
mlx5_0	1	0	< removed >								v1	enp1s0f0np0
mlx5_0	1	1	< removed >								v2	enp1s0f0np0
mlx5_0	1	2	< removed >				10.15.15.1  	v1	enp1s0f0np0.2
mlx5_0	1	3	< removed >				10.15.15.1  	v2	enp1s0f0np0.2
mlx5_0	1	4	< removed >								v1	enp1s0f0np0.2
mlx5_0	1	5	< removed >								v2	enp1s0f0np0.2
mlx5_1	1	0	< removed >								v1	enp1s0f1np1
mlx5_1	1	1	< removed >								v2	enp1s0f1np1
mlx5_1	1	2	< removed >				10.15.15.1  	v1	enp1s0f1np1.2
mlx5_1	1	3	< removed >				10.15.15.1  	v2	enp1s0f1np1.2

# lspci
01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
# ip link
5: enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
6: enp1s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
13: enp1s0f0np0.2@enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
14: enp1s0f1np1.2@enp1s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff

After:

# show_gids
DEV	PORT	INDEX	GID					IPv4  		VER	DEV
---	----	-----	---					------------  	---	---
mlx5_2	1	0	< removed >								v1	enp1s0f0v0
mlx5_2	1	1	< removed >								v2	enp1s0f0v0
mlx5_3	1	0	< removed >								v1	enp1s0f0v1
mlx5_3	1	1	< removed >								v2	enp1s0f0v1
mlx5_4	1	0	< removed >								v1	enp1s0f1v0
mlx5_4	1	1	< removed >								v2	enp1s0f1v0
mlx5_5	1	0	< removed >								v1	enp1s0f1v1
mlx5_5	1	1	< removed >								v2	enp1s0f1v1
# lspci
01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
01:00.2 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:00.3 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:01.2 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
01:01.3 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]

# ip link
5: enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
6: enp1s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
15: enp1s0f0npf0vf0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
16: enp1s0f0npf0vf1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
17: enp1s0f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff permaddr < removed >
18: enp1s0f0v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff permaddr < removed >
21: enp1s0f0np0.2@enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
22: enp1s0f1np1.2@enp1s0f1np1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
23: enp1s0f1npf1vf0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
24: enp1s0f1npf1vf1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
25: enp1s0f1v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff permaddr < removed >
26: enp1s0f1v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff permaddr < removed >

Host Infos

My host is a proxmox server, kernel 5.19 and i have installed with https://linux.mellanox.com/public/repo/mlnx_ofed/5.8-2.0.3.0/debian11.3/mellanox_mlnx_ofed.list the package mlnx-ofed-all.
In the end, a mesh network with 3 servers is planned. Therefore the explicit routes in the network part.

Links

I am following this link:

If you still need info, please just say.

Already many thanks for your help.

Kind regards

Dear kay5,

Thank you for reaching out to Nvidia technical support.

In relation to ConnectX-5 Ex, it is worth mentioning that switchdev mode and offloading are supported.

Please consider the following points for review:

  1. Ensure that different NICs are not assigned the same IP address, as this could potentially cause issues. Please verify if this is intentional.
  2. It appears that the second port (enp1s0f1np1) is currently in NO-CARRIER mode. It is recommended to resolve the link issue if this port is being utilized.
  3. Refer to the provided guide for instructions on setting the MAC address for the VFs.
  4. We are unable to comment on the OVS configuration without the OVS configuration/dump-flows/logs. Kindly provide these details for us to assess the OVS configuration accurately.

Best regards,

Nvidia support

Dear ypetrov,

thanks for your answer and please excuse my late reply.

Regarding the points you noted:

  1. I have adjusted the networkconfig for testing.
    I have change the ip-addresses, please do not be surprised.
    Unfortunately, it did not change the result of show_gids.
    Not even after I have configured the mac address as explained in point 3.
auto enp1s0f0np0
iface enp1s0f0np0 inet manual

#auto enp1s0f0np0.2
#iface enp1s0f0np0.2 inet static
#    address 172.19.189.163/28
#    up ip route add 172.19.189.162/32 dev enp1s0f0np0.2
#    down ip route del 172.19.189.162/32

auto enp1s0f1np1
iface enp1s0f1np1 inet manual

auto enp1s0f1np1.2
iface enp1s0f1np1.2 inet static
    address 172.19.189.163/28
    up ip route add 172.19.189.161/32 dev enp1s0f1np1.2
    down ip route del 172.19.189.161/32
  1. This is because for this test only two nodes were online, I have now also the third online.
# ip link
5: enp1s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether e4:11:22:33:44:50 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 1     link/ether e4:11:22:33:44:51 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
6: enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether e4:11:22:33:44:60 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
    vf 1     link/ether e4:11:22:33:44:61 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state disable, trust off, query_rss off
16: enp1s0f1np1.2@enp1s0f1np1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
19: enp1s0f0npf0vf0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
20: enp1s0f0npf0vf1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
21: enp1s0f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff permaddr < removed >
22: enp1s0f0v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff permaddr < removed >
25: enp1s0f1npf1vf0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
26: enp1s0f1npf1vf1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff
27: enp1s0f1v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff permaddr < removed >
28: enp1s0f1v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether < removed > brd ff:ff:ff:ff:ff:ff permaddr < removed >
  1. I update the script to configure the networkcard:
# cat  /etc/systemd/system/mlxn_ofed.service
[Unit]
Description=Configure nvidia connectx network card
Before=network.target
After=openibd.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=bash -c "mlnx_qos -i enp1s0f0np0 --pfc 0,0,0,0,1,0,0,0"
ExecStart=bash -c "echo 2 > /sys/class/net/enp1s0f0np0/device/sriov_numvfs"
ExecStart=bash -c "ip link set enp1s0f0np0 vf 0 mac e4:11:22:33:44:50"
ExecStart=bash -c "ip link set enp1s0f0np0 vf 1 mac e4:11:22:33:44:51"
ExecStart=bash -c "echo 0000:01:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind"
ExecStart=bash -c "echo 0000:01:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind"
ExecStart=bash -c "devlink dev eswitch set pci/0000:01:00.0 mode switchdev"
ExecStart=bash -c "echo 0000:01:00.2 > /sys/bus/pci/drivers/mlx5_core/bind"
ExecStart=bash -c "echo 0000:01:00.3 > /sys/bus/pci/drivers/mlx5_core/bind"
ExecStart=bash -c "mlnx_qos -i enp1s0f1np1 --pfc 0,0,0,0,1,0,0,0"
ExecStart=bash -c "echo 2 > /sys/class/net/enp1s0f1np1/device/sriov_numvfs"
ExecStart=bash -c "ip link set enp1s0f1np1 vf 0 mac e4:11:22:33:44:60"
ExecStart=bash -c "ip link set enp1s0f1np1 vf 1 mac e4:11:22:33:44:61"
ExecStart=bash -c "echo 0000:01:01.2 > /sys/bus/pci/drivers/mlx5_core/unbind"
ExecStart=bash -c "echo 0000:01:01.3 > /sys/bus/pci/drivers/mlx5_core/unbind"
ExecStart=bash -c "devlink dev eswitch set pci/0000:01:00.1 mode switchdev"
ExecStart=bash -c "echo 0000:01:01.2 > /sys/bus/pci/drivers/mlx5_core/bind"
ExecStart=bash -c "echo 0000:01:01.3 > /sys/bus/pci/drivers/mlx5_core/bind"

[Install]
WantedBy=multi-user.target
  1. I haven’t yet configure ovs but i can give you some logs, i hope this are the right ones:
# cat /var/log/openvswitch/ovs-ctl.log
Mon Jun 12 19:14:02 UTC 2023:load-kmod
Inserting psample module.
Inserting openvswitch module.
Mon Jun 12 19:14:02 UTC 2023:start --system-id=random
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db.
Starting ovsdb-server.
Configuring Open vSwitch system IDs.
Starting ovs-vswitchd.
Enabling remote OVSDB managers.
# cat /var/log/openvswitch/ovs-vswitchd.log
2023-06-12T19:12:50.659Z|00002|daemon_unix(monitor)|INFO|pid 4890 died, exit status 0, exiting
2023-06-12T19:14:02.732Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2023-06-12T19:14:02.757Z|00002|ovs_numa|INFO|Discovered 12 CPU cores on NUMA node 0
2023-06-12T19:14:02.757Z|00003|ovs_numa|INFO|Discovered 1 NUMA nodes and 12 CPU cores
2023-06-12T19:14:02.757Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2023-06-12T19:14:02.757Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2023-06-12T19:14:02.757Z|00006|netdev_offload|INFO|netdev: Flow API Disabled. Sub-offload configurations are ignored.
2023-06-12T19:14:02.757Z|00007|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable
2023-06-12T19:14:02.759Z|00008|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.17.7-e054917
2023-06-12T19:14:17.539Z|00009|memory|INFO|155004 kB peak resident set size after 14.8 seconds
2023-06-12T19:14:17.539Z|00010|memory|INFO|idl-cells:19
# cat /var/log/openvswitch/ovsdb-server.log
2023-06-12T19:12:50.764Z|00002|daemon_unix(monitor)|INFO|pid 4775 died, exit status 0, exiting
2023-06-12T19:14:02.593Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2023-06-12T19:14:02.617Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.17.7
2023-06-12T19:14:12.628Z|00003|memory|INFO|18436 kB peak resident set size after 10.0 seconds
2023-06-12T19:14:12.628Z|00004|memory|INFO|atoms:48 cells:41 monitors:2 sessions:1

I hope this helps you.

Best regards,
kay5

Hello,

Thank you for your prompt response and providing us with the additional information. We appreciate your cooperation.

I understand that you are seeking further assistance, but at this stage, we would require additional logs and information to better understand and troubleshoot the issue you are facing. Regarding your main question about the compatibility of RoCE and Open vSwitch (OVS) with hardware offloads, I can confirm that they can work together in parallel as long as all the necessary conditions are met and the steps outlined in the documentation guide are followed accurately.

To proceed with resolving the issue, I recommend referring to the documentation guide provided by us, as it contains detailed instructions and troubleshooting steps specific to your setup. Additionally, it may be beneficial to consult with the vendor of your system for further guidance and support.

For any further questions or concerns, please feel free to reach out. We are here to assist you.

Best regards,

Yogev Petrov
Nvidia support