Confused local protection error

There is one ROCE NIC in my workstation, and it has two ports, so I can see two interfaces in the output of ifconfig as below:

rdma2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 10.0.24.2 netmask 255.255.255.0 broadcast 10.0.24.255

inet6 fe80::526b:4bff:fed3:d574 prefixlen 64 scopeid 0x20

ether 50:6b:4b:d3:d5:74 txqueuelen 1000 (Ethernet)

RX packets 15 bytes 2310 (2.2 KiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 577 bytes 35360 (34.5 KiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

rdma3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 10.0.24.3 netmask 255.255.255.0 broadcast 10.0.24.255

inet6 fe80::526b:4bff:fed3:d575 prefixlen 64 scopeid 0x20

ether 50:6b:4b:d3:d5:75 txqueuelen 1000 (Ethernet)

RX packets 574 bytes 35568 (34.7 KiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 23 bytes 2966 (2.8 KiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Today I run distributed tensorflow with two workers, every worker use libibverbs api (tensorflow gdr) to enable rdma during communication. And one worker bind ip address 10.0.0.24.2:8000, another worker bind ip address 10.0.0.24.3:9000. When training, an error occurs in the worker binds 10.0.0.24.3:9000:

mlx5: gpu5.maas: got completion with error:

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 93003204 1000010f 000096d2

But if I let the two worker bind the same ip address but different port, such as one worker bind 10.0.24.2:8000, another bind 10.0.24.2:9000, this question don’t appear.

I am very confused about this problem, and how can I debug it?

Hi,

Those kinds of the error almost always caused by application. You should check check that access rights, memory, device port numbers, sizes of send receive queues, MTU, everything is allocated correctly.

Another point, check if using different IP subnets on interfaces help. YOu are using same subnet 10.0.24.0/24 on both ports. It can be something in routing tables, that confused RDMA CM