PFC not working with RDMA over RoCEv2 and ConnectX-7

I can not establish a connection between two servers with ConnectX-7 NICs over RoCEv2, when PFC is enabled.
When PFC is turned off (No TOS values is set), everything works fine.
Basic IP communication like ping is working all the time.
Below is a detailed summary of my troubleshooting:


RoCEv2 / RDMA behavior with Cisco Nexus 9300 Switch + Mellanox ConnectX-7

Summary of all troubleshooting steps and observations

1) Environment

Topology

  • 2 servers with Mellanox ConnectX NICs (mlx5)
  • 1 Cisco Nexus 9000 switch (hostname: N9K-AI-BACKEND-1)
  • Point-to-point routed setup via the switch
node-01: mlx5_1 → enp129s0f1np1 → 10.13.150.2/30
                         |
                         |  Cisco Nexus 9000
                         |
node-02: mlx5_1 → enp129s0f1np1 → 10.13.160.2/30

Switch-facing interfaces

node-01 <-> Ethernet1/29 <-> switch
node-02 <-> Ethernet1/31 <-> switch

Link speed / topology note

  • Both server links are 200G.
  • These two servers are the only devices attached to the switch for this traffic path.
  • No other attached endpoints send traffic to these two servers during these tests.

2) General problem statement

The issue only appears when non-zero TOS / traffic class is used for RoCE-related traffic.

Observed error strings:

ib_write_bw:
  "Bad wc status 12"
  "Failed status 12: wr_id 0 syndrom 0x81"

ibv_rc_pingpong:
  "transport retry counter exceeded"

rping:
  "cq completion failed status 12"

3) Verified baseline / general checks

The following checks were completed successfully during troubleshooting:

✓ ICMP ping works between both nodes
✓ Jumbo ping works (ping -s 8972)
✓ UDP port 5555 works end-to-end
✓ GID index 3 (IPv4 + RoCEv2) is correct on both nodes
✓ cma_roce_mode is set to RoCEv2 on both nodes
✓ Link layer is Ethernet on both nodes
✓ rdma link state is ACTIVE on the correct netdev
✓ IP routing resolves correctly via enp129s0f1np1 / the correct netdev
✓ ARP resolution works correctly
✓ No iptables rules are blocking the traffic
✓ No tunnel interfaces are present on either node
✓ /dev/infiniband/rdma_cm exists on both nodes
✓ Required RDMA kernel modules are loaded
✓ The switch has no relevant ACLs affecting this traffic
✓ Switch CoPP shows no drops
✓ ELAM on the Nexus shows:
- packets are seen arriving on Eth1/31
- outgoing interface resolves to Eth1/29
- no drops are shown in ELAM
✓ network-qos MTU was changed from 4200 to 9216
✓ NIC hardware counters confirmed that packets were being sent

4) RDMA / NIC state on both nodes

GID table example

DEV     PORT  INDEX  GID                                     IPv4          VER
mlx5_1  1     0      fe80:0000:0000:0000:bae9:24ff:fe3a:bdff               v1
mlx5_1  1     1      fe80:0000:0000:0000:bae9:24ff:fe3a:bdff               v2
mlx5_1  1     2      0000:0000:0000:0000:0000:ffff:0a0d:a002  10.13.160.2  v1
mlx5_1  1     3      0000:0000:0000:0000:0000:ffff:0a0d:a002  10.13.160.2  v2

Link layer

cat /sys/class/infiniband/mlx5_1/ports/1/link_layer
Ethernet

RoCE mode

cma_roce_mode -d mlx5_1 -p 1
RoCE v2

RDMA link state

rdma link show
link mlx5_1/1 state ACTIVE physical_state LINK_UP netdev enp129s0f1np1

Route check

ip route get 10.13.150.2
10.13.150.2 via 10.13.160.1 dev enp129s0f1np1 src 10.13.160.2

Infiniband device nodes

ls /dev/infiniband/
rdma_cm  umad0  umad1  uverbs0  uverbs1

5) Mellanox QoS / DCB state on both nodes

The following mlnx_qos output was captured on both nodes. The outputs were identical.

node-02

root@ai-compute-node-02:~# mlnx_qos -i enp129s0f1np1
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
        prio:0 dscp:07,06,05,04,03,02,01,00,
        prio:1 dscp:15,14,13,12,11,10,09,08,
        prio:2 dscp:23,22,21,20,19,18,17,16,
        prio:3 dscp:31,30,29,28,27,26,25,24,
        prio:4 dscp:39,38,37,36,35,34,33,32,
        prio:5 dscp:47,46,45,44,43,42,41,40,
        prio:6 dscp:55,54,53,52,51,50,49,48,
        prio:7 dscp:63,62,61,60,59,58,57,56,
Receive buffer size (bytes): 19872,276768,0,0,0,0,0,0,max_buffer_size=2069280
Cable len: 7
PFC configuration:
        priority    0   1   2   3   4   5   6   7
        enabled     0   0   0   1   0   0   0   0
        buffer      0   0   0   1   0   0   0   0
tc: 0 ratelimit: unlimited, tsa: vendor
         priority:  1
tc: 1 ratelimit: unlimited, tsa: vendor
         priority:  0
tc: 2 ratelimit: unlimited, tsa: vendor
         priority:  2
tc: 3 ratelimit: unlimited, tsa: vendor
         priority:  3
tc: 4 ratelimit: unlimited, tsa: vendor
         priority:  4
tc: 5 ratelimit: unlimited, tsa: vendor
         priority:  5
tc: 6 ratelimit: unlimited, tsa: vendor
         priority:  6
tc: 7 ratelimit: unlimited, tsa: vendor
         priority:  7

node-01

admin@ai-compute-node-01:~$ mlnx_qos -i enp129s0f1np1
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
        prio:0 dscp:07,06,05,04,03,02,01,00,
        prio:1 dscp:15,14,13,12,11,10,09,08,
        prio:2 dscp:23,22,21,20,19,18,17,16,
        prio:3 dscp:31,30,29,28,27,26,25,24,
        prio:4 dscp:39,38,37,36,35,34,33,32,
        prio:5 dscp:47,46,45,44,43,42,41,40,
        prio:6 dscp:55,54,53,52,51,50,49,48,
        prio:7 dscp:63,62,61,60,59,58,57,56,
Receive buffer size (bytes): 19872,276768,0,0,0,0,0,0,max_buffer_size=2069280
Cable len: 7
PFC configuration:
        priority    0   1   2   3   4   5   6   7
        enabled     0   0   0   1   0   0   0   0
        buffer      0   0   0   1   0   0   0   0
tc: 0 ratelimit: unlimited, tsa: vendor
         priority:  1
tc: 1 ratelimit: unlimited, tsa: vendor
         priority:  0
tc: 2 ratelimit: unlimited, tsa: vendor
         priority:  2
tc: 3 ratelimit: unlimited, tsa: vendor
         priority:  3
tc: 4 ratelimit: unlimited, tsa: vendor
         priority:  4
tc: 5 ratelimit: unlimited, tsa: vendor
         priority:  5
tc: 6 ratelimit: unlimited, tsa: vendor
         priority:  6
tc: 7 ratelimit: unlimited, tsa: vendor
         priority:  7

ethtool observations

  • ethtool -S enp129s0f1np1 | egrep -i 'pfc|pause|stopped|drop|disc'
  • Observed result on both nodes during these checks: all queried counters remained 0.

6) Switch QoS / PFC configuration used

Classification / QoS / PFC configuration

class-map type qos match-any ROCEv2
  match dscp 26

policy-map type qos QOS_CLASSIFICATION
  class ROCEv2
    set qos-group 3

policy-map type network-qos qos_network
  class type network-qos c-8q-nq3
    mtu 9216
    pause pfc-cos 3
  class type network-qos c-8q-nq-default
    mtu 9216

policy-map type queuing QOS_EGRESS_PORT
  class type queuing c-out-8q-q3
    bandwidth remaining percent 50
    random-detect minimum-threshold 950 kbytes
    maximum-threshold 3000 kbytes
    drop-probability 7 weight 0 ecn

interface Ethernet1/29
  priority-flow-control mode on
  mtu 9216
  service-policy type qos input QOS_CLASSIFICATION

interface Ethernet1/31
  priority-flow-control mode on
  mtu 9216
  service-policy type qos input QOS_CLASSIFICATION

Notes from troubleshooting

  • Earlier in troubleshooting, the network-qos MTU for the RoCE class had been 4200; it was later changed to 9216.
  • ECN was later removed in a separate test. The observed ib_write_bw behavior remained unchanged in that test.

7) TOS methods used

Method A: sysfs traffic_class

echo 104 > /sys/class/infiniband/mlx5_1/tc/1/traffic_class

Method B: cma_roce_tos

cma_roce_tos -d mlx5_1 -t 104

8) Test matrix and results

All tests including client and server output, switch packet counters

Test 1: ib_write_bw without -R, TOS via echo = 0

Matrix

TOS via Echo TOS via cma_roce_tos
0 0
0 104

Both combinations produced the same result.


Test 1.1: ib_write_bw -d mlx5_1 -F

Result

  • No error
  • Test completed successfully
  • Switch did not detect packets with DSCP 26

Server output

root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0570 PSN 0x879b8c RKey 0x060400 VAddr 0x0078bdf184a000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
 remote address: LID 0000 QPN 0x034f PSN 0x5783f6 RKey 0x23beda VAddr 0x0077d83de80000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             186.52             186.35               0.355431
---------------------------------------------------------------------------------------

Client output

root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F --report_gbits 10.13.150.2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x034f PSN 0x5783f6 RKey 0x23beda VAddr 0x0077d83de80000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
 remote address: LID 0000 QPN 0x0570 PSN 0x879b8c RKey 0x060400 VAddr 0x0078bdf184a000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5000             186.52             186.35               0.355431

Switch observation

No packets with DSCP 26 detected.

Test 1.2: ib_write_bw -d mlx5_1 -F -a

Result

  • No error
  • Test completed successfully
  • Switch did not detect packets with DSCP 26

Server output

root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -a
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 CQ Moderation   : 100
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0571 PSN 0x67399f RKey 0x060400 VAddr 0x0073eb32dff000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
 remote address: LID 0000 QPN 0x0350 PSN 0xe3747 RKey 0x23be00 VAddr 0x007bed877ff000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 8388608    5000             186.40             186.35               0.002777
---------------------------------------------------------------------------------------

Client output

root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F --report_gbits -a 10.13.150.2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0350 PSN 0xe3747 RKey 0x23be00 VAddr 0x007bed877ff000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
 remote address: LID 0000 QPN 0x0571 PSN 0x67399f RKey 0x060400 VAddr 0x0073eb32dff000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.049467            0.048618            3.038600
 4          5000             0.12               0.12                 3.602183
 8          5000             0.23               0.23                 3.621518
 16         5000             0.46               0.46                 3.601976
 32         5000             0.93               0.93                 3.625831
 64         5000             1.86               1.86                 3.632780
 128        5000             3.71               3.70                 3.615469
 256        5000             7.39               7.38                 3.604389
 512        5000             14.76              14.74                3.598350
 1024       5000             29.37              29.34                3.582038
 2048       5000             58.20              58.13                3.547757
 4096       5000             116.09             115.44               3.522918
 8192       5000             193.09             192.96               2.944301
 16384      5000             188.36             188.17               1.435660
 32768      5000             186.57             186.45               0.711250
 65536      5000             186.33             186.25               0.355243
 131072     5000             185.87             185.84               0.177230
 262144     5000             186.06             185.89               0.088640
 524288     5000             185.86             185.82               0.044302
 1048576    5000             186.36             186.35               0.022215
 2097152    5000             186.39             186.35               0.011107
 4194304    5000             186.36             186.35               0.005554
 8388608    5000             186.40             186.35               0.002777
---------------------------------------------------------------------------------------

Switch observation

No packets with DSCP 26 detected.

Test 2: ib_write_bw without -a, TOS via echo = 104

Matrix

TOS via Echo TOS via cma_roce_tos
104 0
104 104

Both combinations produced the same result.

Result

  • Error
  • Switch did not detect packets with DSCP 26
  • No switch drops observed

Server output

root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0565 PSN 0x88ffe8 RKey 0x0604d9 VAddr 0x0079c6ff8c2000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
 remote address: LID 0000 QPN 0x0344 PSN 0xe1bb30 RKey 0x23bed9 VAddr 0x007a486535e000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
ethernet_read_keys: Couldn't read remote address
 Unable to read to socket/rdma_cm
 Failed to exchange data between server and clients

Client output

root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F 10.13.150.2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0344 PSN 0xe1bb30 RKey 0x23bed9 VAddr 0x007a486535e000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
 remote address: LID 0000 QPN 0x0565 PSN 0x88ffe8 RKey 0x0604d9 VAddr 0x0079c6ff8c2000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 Completion with error at client
 Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully

Switch observation

No packets with DSCP 26 detected.
No switch drops observed.

Test 3: ib_write_bw with -a, TOS via echo = 104

Matrix

TOS via Echo TOS via cma_roce_tos
104 0
104 104

Both combinations produced the same result.

Result

  • Initial packets are sent
  • Test then fails with the same error pattern as Test 2

Server output

root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -a
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 CQ Moderation   : 100
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0566 PSN 0x7ff1d6 RKey 0x060400 VAddr 0x0076daf61ff000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
 remote address: LID 0000 QPN 0x0345 PSN 0xd6adbb RKey 0x23be00 VAddr 0x007808bcb2e000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
ethernet_read_keys: Couldn't read remote address
 Unable to read to socket/rdma_cm
 Failed to exchange data between server and clients

Client output

root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -a 10.13.150.2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0345 PSN 0xd6adbb RKey 0x23be00 VAddr 0x007808bcb2e000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
 remote address: LID 0000 QPN 0x0566 PSN 0x7ff1d6 RKey 0x060400 VAddr 0x0076daf61ff000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 2          5000             5.66               5.57                 2.920923
 4          5000             13.55              13.47                3.529919
 8          5000             27.28              27.04                3.544757
 16         5000             54.13              53.96                3.536479
 Completion with error at client
 Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=0
 Failed to complete run_iter_bw function successfully
root@ai-compute-node-02:~#

Switch observation

Eth1/29
  Ingress: 15834 packets with DSCP 26
  No packet drops
  No queued packets
  No PFC packets
  Egress: 20000 packets

Eth1/31
  Ingress: 20000 packets with DSCP 26
  No packet drops
  No queued packets
  No PFC packets
  Egress: 15834 packets

Additional repeated observation:
  Eth1/31 ingress was always 20000 packets
  Eth1/29 ingress varied between 15000 and 16000 packets, but never 20000
  Ingress and egress counts on the two ports always matched complementarily

Test 4: ib_write_bw using rdma_cm (-R), TOS via echo = 104, cma_roce_tos = 104

Matrix

TOS via Echo TOS via cma_roce_tos
104 104

Result

  • With and without -a: no output
  • Commands had to be interrupted manually
  • Switch did not detect packets with DSCP 26
  • No switch drops observed

Server output

root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -R
************************************
_Waiting for client to connect..._
************************************
^C

Client output

root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -R 10.13.150.2
^C

Switch observation

No packets with DSCP 26 detected.
No switch drops observed.

Test 5: ib_write_bw using rdma_cm (-R), TOS via echo = 104, cma_roce_tos = 0

Matrix

TOS via Echo TOS via cma_roce_tos
104 0

Result

  • Client error
  • Switch observed 8 ingress and 8 egress packets on both ports

Server output

root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -R
************************************
_Waiting for client to connect..._
************************************
^C

Client output

root@ai-compute-node-02:~#  ib_write_bw -d mlx5_1 -F -R 10.13.150.2
 Bad wc status 12
 Unable to write to socket/rdma_cm
Failed to sync between client and server before creating RDMA CM connection.
ERRNO: No such file or directory.
Failed to create RDMA CM connection with resources.

Switch observation

Ingress and Egress: 8 packets on Eth1/29 and Eth1/31

Test 6: ib_write_bw using rdma_cm (-R), TOS = 0

Matrix

TOS via Echo TOS via cma_roce_tos
0 0

Result

  • ib_write_bw -R works without error
  • ib_write_bw -R -a also works without error
  • Switch did not detect packets with DSCP 26
  • No switch drops observed

Server output (-R)

root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -R
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 Waiting for client rdma_cm QP to connect
 Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x058a PSN 0x30227b
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
 remote address: LID 0000 QPN 0x0368 PSN 0xf4e3b2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             21588.41            14648.41                    0.234375
---------------------------------------------------------------------------------------

Server output (-R -a)

root@ai-compute-node-01:~# ib_write_bw -d mlx5_1 -F -R -a
************************************
_Waiting for client to connect..._
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 CQ Moderation   : 100
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 Waiting for client rdma_cm QP to connect
 Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x058c PSN 0x5f6b64
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
 remote address: LID 0000 QPN 0x036a PSN 0xcac767
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 8388608    5000             22320.78            22317.14                    0.002790
---------------------------------------------------------------------------------------
root@ai-compute-node-01:~#

Client output (-R)

root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -R 10.13.150.2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 1
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0368 PSN 0xf4e3b2
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
 remote address: LID 0000 QPN 0x058a PSN 0x30227b
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 65536      5000             21588.41            14648.41                    0.234375
---------------------------------------------------------------------------------------

Client output (-R -a)

root@ai-compute-node-02:~# ib_write_bw -d mlx5_1 -F -R -a 10.13.150.2
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_1
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON           Lock-free      : OFF
 ibv_wr* API     : ON           Using DDP      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 CQE Poll Batch  : 16
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : ON
 Data ex. method : rdma_cm
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x036a PSN 0xcac767
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:160:02
 remote address: LID 0000 QPN 0x058c PSN 0x5f6b64
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:10:13:150:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MiB/sec]    BW average[MiB/sec]   MsgRate[Mpps]
 2          5000             5.01               4.85                 2.542956
 4          5000             13.51              13.51                3.540758
 8          5000             26.96              26.94                3.531064
 16         5000             54.49              54.05                3.542242
 32         5000             108.26             108.12               3.542739
 64         5000             216.51             213.97               3.505765
 128        5000             433.59             431.86               3.537838
 256        5000             863.78             863.41               3.536529
 512        5000             1716.28            1714.21              3.510692
 1024       5000             3423.66            3419.20              3.501260
 2048       5000             4260.16            4224.91              2.163152
 4096       5000             4944.21            4797.44              1.228145
 8192       5000             6064.57            5780.67              0.739926
 16384      5000             17082.16            8420.76                     0.538928
 32768      5000             21077.72            19990.12                    0.639684
 65536      5000             22276.43            22214.67                    0.355435
 131072     5000             22270.59            22249.24                    0.177994
 262144     5000             22280.54            22223.47                    0.088894
 524288     5000             22276.86            22221.16                    0.044442
 1048576    5000             22330.90            22318.07                    0.022318
 2097152    5000             22340.31            22310.08                    0.011155
 4194304    5000             22318.65            22315.21                    0.005579
 8388608    5000             22320.78            22317.14                    0.002790
---------------------------------------------------------------------------------------

Switch observation

No packets with DSCP 26 detected.
No switch drops observed.

Test 7: rping


Test 7.1: rping, TOS via echo = 104, cma_roce_tos = 104

Matrix

TOS via Echo TOS via cma_roce_tos
104 104

Result

  • Commands had to be interrupted manually
  • Switch did not detect packets with DSCP 26

Server output

root@ai-compute-node-01:~# rping -s -a 10.13.150.2 -v -d
created cm_id 0x63f615a24b10
rdma_bind_addr successful
rdma_listen
^C

Client output

root@ai-compute-node-02:~# rping -c -a 10.13.150.2 -C 10 -v -d
created cm_id 0x6078a8977b10
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x6078a8977b10 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x6078a8977b10 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x6078a89782b0
created channel 0x6078a89745c0
created cq 0x6078a8978310
created qp 0x6078a897b538
rping_setup_buffers called on cb 0x6078a89747c0
allocated & registered buffers...
cq_thread started.
^C

Switch observation

No packets on ports with DSCP 26.

Test 7.2: rping, TOS via echo = 104, cma_roce_tos = 0

Matrix

TOS via Echo TOS via cma_roce_tos
104 0

Result

  • Connection reaches ESTABLISHED
  • Then fails with status 12

Server output

root@ai-compute-node-01:~# rping -s -a 10.13.150.2 -v -d
created cm_id 0x62c4b87d9b10
rdma_bind_addr successful
rdma_listen
cma_event type RDMA_CM_EVENT_CONNECT_REQUEST cma_id 0x724ccc000ce0 (child)
child cma 0x724ccc000ce0
created pd 0x62c4b87daa50
created channel 0x62c4b87daab0
created cq 0x62c4b87daad0
created qp 0x62c4b87dadd8
rping_setup_buffers called on cb 0x62c4b87d67c0
allocated & registered buffers...
accepting client connection request
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x724ccc000ce0 (child)
ESTABLISHED
recv completion
Received rkey 23b7b5 addr 58cae5b13340 len 64 from peer
server received sink adv
server posted rdma read req
cq completion failed status 12
wait for RDMA_READ_COMPLETE state 11
rping server failed: -1
rping_free_buffers called on cb 0x62c4b87d67c0
destroy cm_id 0x62c4b87d9b10

Client output

root@ai-compute-node-02:~# rping -c -a 10.13.150.2 -C 10 -v -d
created cm_id 0x58cae5b15b10
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x58cae5b15b10 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x58cae5b15b10 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x58cae5b162b0
created channel 0x58cae5b125c0
created cq 0x58cae5b16310
created qp 0x58cae5b19538
rping_setup_buffers called on cb 0x58cae5b127c0
allocated & registered buffers...
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x58cae5b15b10 (parent)
ESTABLISHED
rdma_connect successful
RDMA addr 58cae5b13340 rkey 23b7b5 len 64
send completion
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x58cae5b15b10 (parent)
client DISCONNECT EVENT...
wait for RDMA_WRITE_ADV state 10
rping_free_buffers called on cb 0x58cae5b127c0
destroy cm_id 0x58cae5b15b10

Switch observation

Eth1/29
  Ingress: 14 packets with DSCP 26
  No packet drops
  No queued packets
  No PFC packets
  Egress: 1 packet

Eth1/31
  Ingress: 1 packet with DSCP 26
  No packet drops
  No queued packets
  No PFC packets
  Egress: 14 packets

Test 7.3: rping, TOS = 0

Matrix

TOS via Echo TOS via cma_roce_tos
0 0

Result

  • rping successful
  • Switch did not detect packets with DSCP 26

Server output

root@ai-compute-node-01:~# rping -s -a 10.13.150.2 -v -d
created cm_id 0x65485ce1eb10
rdma_bind_addr successful
rdma_listen
cma_event type RDMA_CM_EVENT_CONNECT_REQUEST cma_id 0x786be4000ce0 (child)
child cma 0x786be4000ce0
created pd 0x65485ce1fa50
created channel 0x65485ce1fab0
created cq 0x65485ce1fad0
created qp 0x65485ce1fdd8
rping_setup_buffers called on cb 0x65485ce1b7c0
allocated & registered buffers...
accepting client connection request
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x786be4000ce0 (child)
recv completion
Received rkey 23bdbb addr 5990ec974340 len 64 from peer
ESTABLISHED
server received sink adv
server posted rdma read req
rdma read completion
server received read complete
server ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
server posted go ahead
send completion
[….]
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x786be4000ce0 (child)
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 10
rping_free_buffers called on cb 0x65485ce1b7c0
destroy cm_id 0x65485ce1eb10

Client output

root@ai-compute-node-02:~# rping -c -a 10.13.150.2 -C 10 -v -d
created cm_id 0x5990ec976b10
cma_event type RDMA_CM_EVENT_ADDR_RESOLVED cma_id 0x5990ec976b10 (parent)
cma_event type RDMA_CM_EVENT_ROUTE_RESOLVED cma_id 0x5990ec976b10 (parent)
rdma_resolve_addr - rdma_resolve_route successful
created pd 0x5990ec9772b0
created channel 0x5990ec9735c0
created cq 0x5990ec977310
created qp 0x5990ec97a538
rping_setup_buffers called on cb 0x5990ec9737c0
allocated & registered buffers...
cq_thread started.
cma_event type RDMA_CM_EVENT_ESTABLISHED cma_id 0x5990ec976b10 (parent)
ESTABLISHED
rdma_connect successful
RDMA addr 5990ec974340 rkey 23bdbb len 64
send completion
recv completion
RDMA addr 5990ec974520 rkey 23bbb9 len 64
send completion
recv completion
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
[…]
rping_free_buffers called on cb 0x5990ec9737c0
cma_event type RDMA_CM_EVENT_DISCONNECTED cma_id 0x5990ec976b10 (parent)
client DISCONNECT EVENT...
destroy cm_id 0x5990ec976b10

Switch observation

No packets with DSCP 26 detected.

9) Switch counter observations that remained consistent across tests

Observed in switch outputs

  • Packets that were counted on ingress policy maps as DSCP 26 were also observed on the opposite side egress counters.
  • In the focused 8-packet snapshots:
    • Ethernet1/29 input policy: 8 DSCP-26 packets
    • Ethernet1/31 input policy: 8 DSCP-26 packets
    • QoS group 3 egress on both ports: 8 packets
  • In the ib_write_bw -F -a test with echo 104:
    • Eth1/31 ingress = 20000
    • Eth1/29 egress = 20000
    • Eth1/29 ingress = 15834
    • Eth1/31 egress = 15834
  • Eth1/31 ingress remained at 20000 in repeated runs of that test.
  • Eth1/29 ingress varied between 15000 and 16000 and did not reach 20000.
  • Ingress and egress counts on the two switch ports matched complementarily in those observations.

Observed queue / congestion counters

In the relevant switch outputs provided during troubleshooting:

  • Queue drops: 0
  • WRED/AFD drops: 0
  • ECN packets: 0
  • Queue depth: 0
  • PFC packets / pause counters: 0

11) Summary of observed behavior only

When TOS / traffic class is 0

Observed successful cases:

  • ib_write_bw -F
  • ib_write_bw -F -a
  • ib_write_bw -F -R
  • ib_write_bw -F -R -a
  • rping

When echo 104 > /sys/class/infiniband/mlx5_1/tc/1/traffic_class is used

Observed outcomes:

  • ib_write_bw -F fails
  • ib_write_bw -F -a starts, then fails after initial sizes
  • ib_write_bw -F -R with cma_roce_tos = 0 fails
  • rping with cma_roce_tos = 0 reaches ESTABLISHED and then fails with status 12

When cma_roce_tos -d mlx5_1 -t 104 is used

Observed outcomes in the listed tests:

  • In the tests above, the switch did not show DSCP 26 matches for the cases where only cma_roce_tos was relied on.
  • ib_write_bw -F -R with echo 104 and cma_roce_tos 104 produced no output and had to be interrupted.
  • rping with echo 104 and cma_roce_tos 104 also had to be interrupted.

12) What I am looking for

I am looking for interpretation of the facts above, especially from anyone familiar with:

  • Mellanox ConnectX RoCEv2 behavior when non-zero traffic class is set
  • Cisco Nexus 9000 QoS / DSCP classification / queue counter interpretation for RoCEv2
  • Differences between:
    • echo <value> > /sys/class/infiniband/mlx5_1/tc/1/traffic_class
    • cma_roce_tos -d mlx5_1 -t <value>
  • Why the test behavior changes depending on TOS setting method and command mode (-R vs no -R)

Hi @dominik.souard,

Thank you for posting your detailed analysis on the NVIDIA Community.

As the first step we recommend to verify that your end‑to‑end software and hardware stack is on a supported combination for RoCE. In particular, please cross‑check:

  • OS and kernel version

  • RDMA stack (MLNX_OFED vs inbox drivers)

  • ConnectX‑7 firmware version

  • Switch OS version

  • Optics/cables and link speed

against the relevant product release notes and compatibility / support matrix for your platform.

Once the stack is confirmed supported, please review your configuration against our RoCE documentation and reference examples:

These guides describe the recommended RoCEv2 and PFC settings on hosts and switches (including DSCP/traffic‑class mapping, MTU, lossless queue configuration, and RDMA‑CM ToS/DSCP behavior) and should help you validate that your current setup aligns with the documented procedures.

Regarding the specific points you raised:

  • Mellanox/NVIDIA ConnectX RoCEv2 behavior with non‑zero traffic class:
    In general, setting a non‑zero traffic class (or ToS/DSCP) is used to classify RoCEv2 flows into a specific priority/queue so that PFC and ECN policies can be applied as documented. The RoCE documentation above, together with the DSCP article, describes how ToS/DSCP is interpreted on the NIC side and how it should be mapped in the network for lossless RoCE.

  • Cisco Nexus 9000 QoS / DSCP classification / queue counters for RoCEv2:
    From the NVIDIA side, the RoCE deployment guides show the expected DSCP values (for example, DSCP 26) and how they are typically mapped to a lossless priority/queue in the network. The exact meaning of Nexus QoS/queue counters, and how DSCP is classified internally on a specific NX‑OS release, is implementation‑specific and should be confirmed against Cisco’s official QoS and RoCE/DCF documentation or with Cisco support.

  • Difference between echo > /sys/class/infiniband/mlx5_1/tc/1/traffic_class and cma_roce_tos -d mlx5_1 -t / behavior with -R:
    In broad terms, cma_roce_tos controls the ToS/DSCP used by RDMA-CM managed QPs (for example, when benchmarks or applications use the RDMA connection manager, such as with ib_* tools in -R mode). The sysfs traffic_class interface is used as the default traffic class for QPs that are created without RDMA-CM (for example, applications that open QPs directly via verbs). Both mechanisms ultimately program the same traffic-class/ToS field, but they apply to different QP creation paths, which is why tests that use rdma_cm (-R) and tests that open QPs directly (no -R) can legitimately show different behavior when only one of these mechanisms is configured.

After aligning your environment with the supported software matrix and the configuration in the documentation above, if you still experience issues, a valid support Entitlement for the HCA in use will be needed to perform additional troubleshooting.
If there an active entitlement/support contract in place, please do not hesitate to open a support ticket by logging into the ESP Portal and submitting a new case.
For contracts, please reach out to Networking-Contracts@nvidia.com

Thanks,
NVEX Networking Technical Support Team