There is one ROCE NIC in my workstation, and it has two ports, so I can see two interfaces in the output of ifconfig as below:
rdma2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.24.2 netmask 255.255.255.0 broadcast 10.0.24.255
inet6 fe80::526b:4bff:fed3:d574 prefixlen 64 scopeid 0x20
ether 50:6b:4b:d3:d5:74 txqueuelen 1000 (Ethernet)
RX packets 15 bytes 2310 (2.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 577 bytes 35360 (34.5 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
rdma3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.24.3 netmask 255.255.255.0 broadcast 10.0.24.255
inet6 fe80::526b:4bff:fed3:d575 prefixlen 64 scopeid 0x20
ether 50:6b:4b:d3:d5:75 txqueuelen 1000 (Ethernet)
RX packets 574 bytes 35568 (34.7 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 23 bytes 2966 (2.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Today I run distributed tensorflow with two workers, every worker use libibverbs api (tensorflow gdr) to enable rdma during communication. And one worker bind ip address 10.0.0.24.2:8000, another worker bind ip address 10.0.0.24.3:9000. When training, an error occurs in the worker binds 10.0.0.24.3:9000:
mlx5: gpu5.maas: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 93003204 1000010f 000096d2
But if I let the two worker bind the same ip address but different port, such as one worker bind 10.0.24.2:8000, another bind 10.0.24.2:9000, this question don’t appear.
I am very confused about this problem, and how can I debug it?