Confused local protection error

xiangyu · February 1, 2020, 7:07am

There is one ROCE NIC in my workstation, and it has two ports, so I can see two interfaces in the output of ifconfig as below:

rdma2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 10.0.24.2 netmask 255.255.255.0 broadcast 10.0.24.255

inet6 fe80::526b:4bff:fed3:d574 prefixlen 64 scopeid 0x20

ether 50:6b:4b:d3:d5:74 txqueuelen 1000 (Ethernet)

RX packets 15 bytes 2310 (2.2 KiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 577 bytes 35360 (34.5 KiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

rdma3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 10.0.24.3 netmask 255.255.255.0 broadcast 10.0.24.255

inet6 fe80::526b:4bff:fed3:d575 prefixlen 64 scopeid 0x20

ether 50:6b:4b:d3:d5:75 txqueuelen 1000 (Ethernet)

RX packets 574 bytes 35568 (34.7 KiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 23 bytes 2966 (2.8 KiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Today I run distributed tensorflow with two workers, every worker use libibverbs api (tensorflow gdr) to enable rdma during communication. And one worker bind ip address 10.0.0.24.2:8000, another worker bind ip address 10.0.0.24.3:9000. When training, an error occurs in the worker binds 10.0.0.24.3:9000:

mlx5: gpu5.maas: got completion with error:

00000000 00000000 00000000 00000000

00000000 93003204 1000010f 000096d2

But if I let the two worker bind the same ip address but different port, such as one worker bind 10.0.24.2:8000, another bind 10.0.24.2:9000, this question don’t appear.

I am very confused about this problem, and how can I debug it?

alekseys1 · February 18, 2020, 4:09pm

Hi,

Those kinds of the error almost always caused by application. You should check check that access rights, memory, device port numbers, sizes of send receive queues, MTU, everything is allocated correctly.

Another point, check if using different IP subnets on interfaces help. YOu are using same subnet 10.0.24.0/24 on both ports. It can be something in routing tables, that confused RDMA CM

Topic		Replies	Views
GPU Direct RDMA send/recv operation failed InfiniBand/VPI Adapter Cards	1	482	July 3, 2024
RDMA not working with ConnectX-6 Software And Drivers iterations , bytes	2	10393	January 29, 2022
I have 2 RDMA nics installed on this server. Ideally two netdevs per port for mlx5_0, example: ens224, ens225 each with their own mac addresses, but they both show up under a single "ens224". Software And Drivers port , nics	1	781	January 20, 2021
RDMA doesn't work between host and DPU RDMA Software For GPU	1	1465	October 2, 2023
INFINIBAND RDMA_CM_EVENT_ADDR_ERROR	1	1083	September 29, 2017
I have problem porting my RDMA application from InfiniBand(Mellanox Connectx-3 40Gb IB) to RoCE(Connectx-4 100GbE). Mellanox OFED	2	820	April 3, 2016
Can't get RDMA working on a KVM-VM on AMD Epyc cores. Software And Drivers	4	1578	March 18, 2020
RoCE / RDMA traffic not passing between two Mellanox 200Gbps NIC servers despite link up DGX Spark / GB10	5	232	January 22, 2026
Infiniband communication hangs when trying to use DirectRDMA InfiniBand/VPI Adapter Cards	3	297	June 12, 2025
mlx4_0, memreg 5 slots 32 ird 16	1	210	July 28, 2016

Confused local protection error

Related topics