I can not establish a connection between two servers with ConnectX-7 NICs over RoCEv2, when PFC is enabled.
When PFC is turned off (No TOS values is set), everything works fine.
Basic IP communication like ping is working all the time.
Below is a detailed summary of my troubleshooting:
RoCEv2 / RDMA behavior with Cisco Nexus 9300 Switch + Mellanox ConnectX-7
Summary of all troubleshooting steps and observations
1) Environment
Topology
- 2 servers with Mellanox ConnectX NICs (
mlx5) - 1 Cisco Nexus 9000 switch (hostname:
N9K-AI-BACKEND-1) - Point-to-point routed setup via the switch
node-01: mlx5_1 → enp129s0f1np1 → 10.13.150.2/30
|
| Cisco Nexus 9000
|
node-02: mlx5_1 → enp129s0f1np1 → 10.13.160.2/30
Switch-facing interfaces
node-01 <-> Ethernet1/29 <-> switch
node-02 <-> Ethernet1/31 <-> switch
Link speed / topology note
- Both server links are 200G.
- These two servers are the only devices attached to the switch for this traffic path.
- No other attached endpoints send traffic to these two servers during these tests.
2) General problem statement
The issue only appears when non-zero TOS / traffic class is used for RoCE-related traffic.
Observed error strings:
ib_write_bw:
"Bad wc status 12"
"Failed status 12: wr_id 0 syndrom 0x81"
ibv_rc_pingpong:
"transport retry counter exceeded"
rping:
"cq completion failed status 12"
3) Verified baseline / general checks
The following checks were completed successfully during troubleshooting:
✓ ICMP ping works between both nodes
✓ Jumbo ping works (ping -s 8972)
✓ UDP port 5555 works end-to-end
✓ GID index 3 (IPv4 + RoCEv2) is correct on both nodes
✓ cma_roce_mode is set to RoCEv2 on both nodes
✓ Link layer is Ethernet on both nodes
✓ rdma link state is ACTIVE on the correct netdev
✓ IP routing resolves correctly via enp129s0f1np1 / the correct netdev
✓ ARP resolution works correctly
✓ No iptables rules are blocking the traffic
✓ No tunnel interfaces are present on either node
✓ /dev/infiniband/rdma_cm exists on both nodes
✓ Required RDMA kernel modules are loaded
✓ The switch has no relevant ACLs affecting this traffic
✓ Switch CoPP shows no drops
✓ ELAM on the Nexus shows:
- packets are seen arriving on Eth1/31
- outgoing interface resolves to Eth1/29
- no drops are shown in ELAM
✓ network-qos MTU was changed from 4200 to 9216
✓ NIC hardware counters confirmed that packets were being sent
4) RDMA / NIC state on both nodes
GID table example
DEV PORT INDEX GID IPv4 VER
mlx5_1 1 0 fe80:0000:0000:0000:bae9:24ff:fe3a:bdff v1
mlx5_1 1 1 fe80:0000:0000:0000:bae9:24ff:fe3a:bdff v2
mlx5_1 1 2 0000:0000:0000:0000:0000:ffff:0a0d:a002 10.13.160.2 v1
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:0a0d:a002 10.13.160.2 v2
Link layer
cat /sys/class/infiniband/mlx5_1/ports/1/link_layer
Ethernet
RoCE mode
cma_roce_mode -d mlx5_1 -p 1
RoCE v2
RDMA link state
rdma link show
link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev enp129s0f1np1
Route check
ip route get 10.13.150.2
10.13.150.2 via 10.13.160.1 dev enp129s0f1np1 src 10.13.160.2
Infiniband device nodes
ls /dev/infiniband/
rdma_cm umad0 umad1 uverbs0 uverbs1
5) Mellanox QoS / DCB state on both nodes
The following mlnx_qos output was captured on both nodes. The outputs were identical.
node-02
root@ai-compute-node-02:~# mlnx_qos -i enp129s0f1np1
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
prio:0 dscp:07,06,05,04,03,02,01,00,
prio:1 dscp:15,14,13,12,11,10,09,08,
prio:2 dscp:23,22,21,20,19,18,17,16,
prio:3 dscp:31,30,29,28,27,26,25,24,
prio:4 dscp:39,38,37,36,35,34,33,32,
prio:5 dscp:47,46,45,44,43,42,41,40,
prio:6 dscp:55,54,53,52,51,50,49,48,
prio:7 dscp:63,62,61,60,59,58,57,56,
Receive buffer size (bytes): 19872,276768,0,0,0,0,0,0,max_buffer_size=2069280
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 1 0 0 0 0
buffer 0 0 0 1 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7
node-01
admin@ai-compute-node-01:~$ mlnx_qos -i enp129s0f1np1
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
prio:0 dscp:07,06,05,04,03,02,01,00,
prio:1 dscp:15,14,13,12,11,10,09,08,
prio:2 dscp:23,22,21,20,19,18,17,16,
prio:3 dscp:31,30,29,28,27,26,25,24,
prio:4 dscp:39,38,37,36,35,34,33,32,
prio:5 dscp:47,46,45,44,43,42,41,40,
prio:6 dscp:55,54,53,52,51,50,49,48,
prio:7 dscp:63,62,61,60,59,58,57,56,
Receive buffer size (bytes): 19872,276768,0,0,0,0,0,0,max_buffer_size=2069280
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 1 0 0 0 0
buffer 0 0 0 1 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7
ethtool observations
ethtool -S enp129s0f1np1 | egrep -i 'pfc|pause|stopped|drop|disc'- Observed result on both nodes during these checks: all queried counters remained
0.
6) Switch QoS / PFC configuration used
Classification / QoS / PFC configuration
class-map type qos match-any ROCEv2
match dscp 26
policy-map type qos QOS_CLASSIFICATION
class ROCEv2
set qos-group 3
policy-map type network-qos qos_network
class type network-qos c-8q-nq3
mtu 9216
pause pfc-cos 3
class type network-qos c-8q-nq-default
mtu 9216
policy-map type queuing QOS_EGRESS_PORT
class type queuing c-out-8q-q3
bandwidth remaining percent 50
random-detect minimum-threshold 950 kbytes
maximum-threshold 3000 kbytes
drop-probability 7 weight 0 ecn
interface Ethernet1/29
priority-flow-control mode on
mtu 9216
service-policy type qos input QOS_CLASSIFICATION
interface Ethernet1/31
priority-flow-control mode on
mtu 9216
service-policy type qos input QOS_CLASSIFICATION
Notes from troubleshooting
- Earlier in troubleshooting, the
network-qosMTU for the RoCE class had been4200; it was later changed to9216. - ECN was later removed in a separate test. The observed
ib_write_bwbehavior remained unchanged in that test.
7) TOS methods used
Method A: sysfs traffic_class
echo 104 > /sys/class/infiniband/mlx5_1/tc/1/traffic_class
Method B: cma_roce_tos
cma_roce_tos -d mlx5_1 -t 104
8) Test matrix and results
All tests including client and server output, switch packet counters
Test 1: ib_write_bw without -R, TOS via echo = 0
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 0 | 0 |
| 0 | 104 |
Both combinations produced the same result.
Test 1.1: ib_write_bw -d mlx5_1 -F
Result
- No error
- Test completed successfully
- Switch did not detect packets with DSCP 26
Server output
root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0570 PSN 0x879b8c RKey 0x060400 VAddr 0x0078bdf184a000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
remote address: LID 0000 QPN 0x034f PSN 0x5783f6 RKey 0x23beda VAddr 0x0077d83de80000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 5000 186.52 186.35 0.355431
---------------------------------------------------------------------------------------
Client output
root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F --report_gbits 10.13.150.2
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x034f PSN 0x5783f6 RKey 0x23beda VAddr 0x0077d83de80000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
remote address: LID 0000 QPN 0x0570 PSN 0x879b8c RKey 0x060400 VAddr 0x0078bdf184a000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 5000 186.52 186.35 0.355431
Switch observation
No packets with DSCP 26 detected.
Test 1.2: ib_write_bw -d mlx5_1 -F -a
Result
- No error
- Test completed successfully
- Switch did not detect packets with DSCP 26
Server output
root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -a
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
CQ Moderation : 100
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0571 PSN 0x67399f RKey 0x060400 VAddr 0x0073eb32dff000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
remote address: LID 0000 QPN 0x0350 PSN 0xe3747 RKey 0x23be00 VAddr 0x007bed877ff000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
8388608 5000 186.40 186.35 0.002777
---------------------------------------------------------------------------------------
Client output
root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F --report_gbits -a 10.13.150.2
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 100
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0350 PSN 0xe3747 RKey 0x23be00 VAddr 0x007bed877ff000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
remote address: LID 0000 QPN 0x0571 PSN 0x67399f RKey 0x060400 VAddr 0x0073eb32dff000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
2 5000 0.049467 0.048618 3.038600
4 5000 0.12 0.12 3.602183
8 5000 0.23 0.23 3.621518
16 5000 0.46 0.46 3.601976
32 5000 0.93 0.93 3.625831
64 5000 1.86 1.86 3.632780
128 5000 3.71 3.70 3.615469
256 5000 7.39 7.38 3.604389
512 5000 14.76 14.74 3.598350
1024 5000 29.37 29.34 3.582038
2048 5000 58.20 58.13 3.547757
4096 5000 116.09 115.44 3.522918
8192 5000 193.09 192.96 2.944301
16384 5000 188.36 188.17 1.435660
32768 5000 186.57 186.45 0.711250
65536 5000 186.33 186.25 0.355243
131072 5000 185.87 185.84 0.177230
262144 5000 186.06 185.89 0.088640
524288 5000 185.86 185.82 0.044302
1048576 5000 186.36 186.35 0.022215
2097152 5000 186.39 186.35 0.011107
4194304 5000 186.36 186.35 0.005554
8388608 5000 186.40 186.35 0.002777
---------------------------------------------------------------------------------------
Switch observation
No packets with DSCP 26 detected.
Test 2: ib_write_bw without -a, TOS via echo = 104
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 104 | 0 |
| 104 | 104 |
Both combinations produced the same result.
Result
- Error
- Switch did not detect packets with DSCP 26
- No switch drops observed
Server output
root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0565 PSN 0x88ffe8 RKey 0x0604d9 VAddr 0x0079c6ff8c2000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
remote address: LID 0000 QPN 0x0344 PSN 0xe1bb30 RKey 0x23bed9 VAddr 0x007a486535e000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients
Client output
root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F 10.13.150.2
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0344 PSN 0xe1bb30 RKey 0x23bed9 VAddr 0x007a486535e000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
remote address: LID 0000 QPN 0x0565 PSN 0x88ffe8 RKey 0x0604d9 VAddr 0x0079c6ff8c2000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
Completion with error at client
Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
Switch observation
No packets with DSCP 26 detected.
No switch drops observed.
Test 3: ib_write_bw with -a, TOS via echo = 104
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 104 | 0 |
| 104 | 104 |
Both combinations produced the same result.
Result
- Initial packets are sent
- Test then fails with the same error pattern as Test 2
Server output
root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -a
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
CQ Moderation : 100
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0566 PSN 0x7ff1d6 RKey 0x060400 VAddr 0x0076daf61ff000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
remote address: LID 0000 QPN 0x0345 PSN 0xd6adbb RKey 0x23be00 VAddr 0x007808bcb2e000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients
Client output
root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -a 10.13.150.2
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 100
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0345 PSN 0xd6adbb RKey 0x23be00 VAddr 0x007808bcb2e000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
remote address: LID 0000 QPN 0x0566 PSN 0x7ff1d6 RKey 0x060400 VAddr 0x0076daf61ff000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
2 5000 5.66 5.57 2.920923
4 5000 13.55 13.47 3.529919
8 5000 27.28 27.04 3.544757
16 5000 54.13 53.96 3.536479
Completion with error at client
Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
root@ai-compute-node-02:~#
Switch observation
Eth1/29
Ingress: 15834 packets with DSCP 26
No packet drops
No queued packets
No PFC packets
Egress: 20000 packets
Eth1/31
Ingress: 20000 packets with DSCP 26
No packet drops
No queued packets
No PFC packets
Egress: 15834 packets
Additional repeated observation:
Eth1/31 ingress was always 20000 packets
Eth1/29 ingress varied between 15000 and 16000 packets, but never 20000
Ingress and egress counts on the two ports always matched complementarily
Test 4: ib_write_bw using rdma_cm (-R), TOS via echo = 104, cma_roce_tos = 104
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 104 | 104 |
Result
- With and without
-a: no output - Commands had to be interrupted manually
- Switch did not detect packets with DSCP 26
- No switch drops observed
Server output
root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -R
************************************
_Waiting for client to connect..._
************************************
^C
Client output
root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -R 10.13.150.2
^C
Switch observation
No packets with DSCP 26 detected.
No switch drops observed.
Test 5: ib_write_bw using rdma_cm (-R), TOS via echo = 104, cma_roce_tos = 0
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 104 | 0 |
Result
- Client error
- Switch observed 8 ingress and 8 egress packets on both ports
Server output
root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -R
************************************
_Waiting for client to connect..._
************************************
^C
Client output
root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -R 10.13.150.2
Bad wc status 12
Unable to write to socket/rdma_cm
Failed to sync between client and server before creating RDMA CM connection.
ERRNO: No such file or directory.
Failed to create RDMA CM connection with resources.
Switch observation
Ingress and Egress: 8 packets on Eth1/29 and Eth1/31
Test 6: ib_write_bw using rdma_cm (-R), TOS = 0
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 0 | 0 |
Result
ib_write_bw -Rworks without errorib_write_bw -R -aalso works without error- Switch did not detect packets with DSCP 26
- No switch drops observed
Server output (-R)
root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -R
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
Waiting for client rdma_cm QP to connect
Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x058a PSN 0x30227b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
remote address: LID 0000 QPN 0x0368 PSN 0xf4e3b2
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 5000 21588.41 14648.41 0.234375
---------------------------------------------------------------------------------------
Server output (-R -a)
root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -R -a
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
CQ Moderation : 100
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
Waiting for client rdma_cm QP to connect
Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x058c PSN 0x5f6b64
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
remote address: LID 0000 QPN 0x036a PSN 0xcac767
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
8388608 5000 22320.78 22317.14 0.002790
---------------------------------------------------------------------------------------
root@ai-compute-node-01:~#
Client output (-R)
root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -R 10.13.150.2
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x0368 PSN 0xf4e3b2
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
remote address: LID 0000 QPN 0x058a PSN 0x30227b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
65536 5000 21588.41 14648.41 0.234375
---------------------------------------------------------------------------------------
Client output (-R -a)
root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -R -a 10.13.150.2
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_1
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 100
CQE Poll Batch : 16
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x036a PSN 0xcac767
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
remote address: LID 0000 QPN 0x058c PSN 0x5f6b64
GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps]
2 5000 5.01 4.85 2.542956
4 5000 13.51 13.51 3.540758
8 5000 26.96 26.94 3.531064
16 5000 54.49 54.05 3.542242
32 5000 108.26 108.12 3.542739
64 5000 216.51 213.97 3.505765
128 5000 433.59 431.86 3.537838
256 5000 863.78 863.41 3.536529
512 5000 1716.28 1714.21 3.510692
1024 5000 3423.66 3419.20 3.501260
2048 5000 4260.16 4224.91 2.163152
4096 5000 4944.21 4797.44 1.228145
8192 5000 6064.57 5780.67 0.739926
16384 5000 17082.16 8420.76 0.538928
32768 5000 21077.72 19990.12 0.639684
65536 5000 22276.43 22214.67 0.355435
131072 5000 22270.59 22249.24 0.177994
262144 5000 22280.54 22223.47 0.088894
524288 5000 22276.86 22221.16 0.044442
1048576 5000 22330.90 22318.07 0.022318
2097152 5000 22340.31 22310.08 0.011155
4194304 5000 22318.65 22315.21 0.005579
8388608 5000 22320.78 22317.14 0.002790
---------------------------------------------------------------------------------------
Switch observation
No packets with DSCP 26 detected.
No switch drops observed.
Test 7: rping
Test 7.1: rping, TOS via echo = 104, cma_roce_tos = 104
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 104 | 104 |
Result
- Commands had to be interrupted manually
- Switch did not detect packets with DSCP 26
Server output
root@ai-compute-node-01:~# rping -s -a 10.13.150.2 -v -d
created cm_id 0x63f615a24b10
rdma_bind_addr successful
rdma_listen
^C
Client output
root@ai-compute-node-02:~# rping -c -a 10.13.150.2 -C 10 -v -d
created cm_id 0x6078a8977b10
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x6078a8977b10 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x6078a8977b10 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x6078a89782b0
created channel 0x6078a89745c0
created cq 0x6078a8978310
created qp 0x6078a897b538
rping_setup_buffers called on cb 0x6078a89747c0
allocated & registered buffers...
cq_thread started.
^C
Switch observation
No packets on ports with DSCP 26.
Test 7.2: rping, TOS via echo = 104, cma_roce_tos = 0
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 104 | 0 |
Result
- Connection reaches ESTABLISHED
- Then fails with status 12
Server output
root@ai-compute-node-01:~# rping -s -a 10.13.150.2 -v -d
created cm_id 0x62c4b87d9b10
rdma_bind_addr successful
rdma_listen
cma_event type RDMA_CM_EVENT_CONNECT_REQUEST cma_id 0x724ccc000ce0 (child)
child cma 0x724ccc000ce0
created pd 0x62c4b87daa50
created channel 0x62c4b87daab0
created cq 0x62c4b87daad0
created qp 0x62c4b87dadd8
rping_setup_buffers called on cb 0x62c4b87d67c0
allocated & registered buffers...
accepting client connection request
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x724ccc000ce0 (child)
ESTABLISHED
recv completion
Received rkey 23b7b5 addr 58cae5b13340 len 64 from peer
server received sink adv
server posted rdma read req
cq completion failed status 12
wait for RDMA_READ_COMPLETE state 11
rping server failed: -1
rping_free_buffers called on cb 0x62c4b87d67c0
destroy cm_id 0x62c4b87d9b10
Client output
root@ai-compute-node-02:~# rping -c -a 10.13.150.2 -C 10 -v -d
created cm_id 0x58cae5b15b10
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x58cae5b15b10 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x58cae5b15b10 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x58cae5b162b0
created channel 0x58cae5b125c0
created cq 0x58cae5b16310
created qp 0x58cae5b19538
rping_setup_buffers called on cb 0x58cae5b127c0
allocated & registered buffers...
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x58cae5b15b10 (parent)
ESTABLISHED
rdma_connect successful
RDMA addr 58cae5b13340 rkey 23b7b5 len 64
send completion
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x58cae5b15b10 (parent)
client DISCONNECT EVENT...
wait for RDMA_WRITE_ADV state 10
rping_free_buffers called on cb 0x58cae5b127c0
destroy cm_id 0x58cae5b15b10
Switch observation
Eth1/29
Ingress: 14 packets with DSCP 26
No packet drops
No queued packets
No PFC packets
Egress: 1 packet
Eth1/31
Ingress: 1 packet with DSCP 26
No packet drops
No queued packets
No PFC packets
Egress: 14 packets
Test 7.3: rping, TOS = 0
Matrix
| TOS via Echo | TOS via cma_roce_tos |
|---|---|
| 0 | 0 |
Result
rpingsuccessful- Switch did not detect packets with DSCP 26
Server output
root@ai-compute-node-01:~# rping -s -a 10.13.150.2 -v -d
created cm_id 0x65485ce1eb10
rdma_bind_addr successful
rdma_listen
cma_event type RDMA_CM_EVENT_CONNECT_REQUEST cma_id 0x786be4000ce0 (child)
child cma 0x786be4000ce0
created pd 0x65485ce1fa50
created channel 0x65485ce1fab0
created cq 0x65485ce1fad0
created qp 0x65485ce1fdd8
rping_setup_buffers called on cb 0x65485ce1b7c0
allocated & registered buffers...
accepting client connection request
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x786be4000ce0 (child)
recv completion
Received rkey 23bdbb addr 5990ec974340 len 64 from peer
ESTABLISHED
server received sink adv
server posted rdma read req
rdma read completion
server received read complete
server ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
server posted go ahead
send completion
[….]
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x786be4000ce0 (child)
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 10
rping_free_buffers called on cb 0x65485ce1b7c0
destroy cm_id 0x65485ce1eb10
Client output
root@ai-compute-node-02:~# rping -c -a 10.13.150.2 -C 10 -v -d
created cm_id 0x5990ec976b10
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x5990ec976b10 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x5990ec976b10 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x5990ec9772b0
created channel 0x5990ec9735c0
created cq 0x5990ec977310
created qp 0x5990ec97a538
rping_setup_buffers called on cb 0x5990ec9737c0
allocated & registered buffers...
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x5990ec976b10 (parent)
ESTABLISHED
rdma_connect successful
RDMA addr 5990ec974340 rkey 23bdbb len 64
send completion
recv completion
RDMA addr 5990ec974520 rkey 23bbb9 len 64
send completion
recv completion
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
[…]
rping_free_buffers called on cb 0x5990ec9737c0
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x5990ec976b10 (parent)
client DISCONNECT EVENT...
destroy cm_id 0x5990ec976b10
Switch observation
No packets with DSCP 26 detected.
9) Switch counter observations that remained consistent across tests
Observed in switch outputs
- Packets that were counted on ingress policy maps as DSCP 26 were also observed on the opposite side egress counters.
- In the focused 8-packet snapshots:
Ethernet1/29input policy: 8 DSCP-26 packetsEthernet1/31input policy: 8 DSCP-26 packets- QoS group 3 egress on both ports: 8 packets
- In the
ib_write_bw -F -atest withecho 104:Eth1/31 ingress = 20000Eth1/29 egress = 20000Eth1/29 ingress = 15834Eth1/31 egress = 15834
Eth1/31ingress remained at20000in repeated runs of that test.Eth1/29ingress varied between15000and16000and did not reach20000.- Ingress and egress counts on the two switch ports matched complementarily in those observations.
Observed queue / congestion counters
In the relevant switch outputs provided during troubleshooting:
- Queue drops:
0 - WRED/AFD drops:
0 - ECN packets:
0 - Queue depth:
0 - PFC packets / pause counters:
0
11) Summary of observed behavior only
When TOS / traffic class is 0
Observed successful cases:
ib_write_bw -Fib_write_bw -F -aib_write_bw -F -Rib_write_bw -F -R -arping
When echo 104 > /sys/class/infiniband/mlx5_1/tc/1/traffic_class is used
Observed outcomes:
ib_write_bw -Ffailsib_write_bw -F -astarts, then fails after initial sizesib_write_bw -F -Rwithcma_roce_tos = 0failsrpingwithcma_roce_tos = 0reaches ESTABLISHED and then fails with status 12
When cma_roce_tos -d mlx5_1 -t 104 is used
Observed outcomes in the listed tests:
- In the tests above, the switch did not show DSCP 26 matches for the cases where only
cma_roce_toswas relied on. ib_write_bw -F -Rwithecho 104andcma_roce_tos 104produced no output and had to be interrupted.rpingwithecho 104andcma_roce_tos 104also had to be interrupted.
12) What I am looking for
I am looking for interpretation of the facts above, especially from anyone familiar with:
- Mellanox ConnectX RoCEv2 behavior when non-zero traffic class is set
- Cisco Nexus 9000 QoS / DSCP classification / queue counter interpretation for RoCEv2
- Differences between:
echo <value> > /sys/class/infiniband/mlx5_1/tc/1/traffic_classcma_roce_tos -d mlx5_1 -t <value>
- Why the test behavior changes depending on TOS setting method and command mode (
-Rvs no-R)