Trouble using ConnectX-5 Ex cards using host chaining mode. Some connections work but not others.

I have 3 servers, each with a dual port ConnectX-5 Ex card. I have them connected in a chain like this: S1 → S2 → S3. Server 2 (S2) uses both ports and has (by default) host chaining mode set to BASIC(1). I assigned IP addresses to all the ports and try to ping on each server the other two and I find some combinations that don’t work.

S1 can ping S2 and S3 correctly. (which means host chaining seems to be working)

S2 can ping S1, but pings to S3 fail

S3 can ping S1, but pings to S2 fail

I tried a different topology. Now I have S1 → S3 → S2

S1 can ping S2 but pings to S3 fail

S2 can ping S1 and S3

S3 can ping S2 but pings to S1 fail

Any thoughts as to what may be happening? Or how to get more information to fix this issue? It seems weird that in the first experiment S3 responds to pings from S1 but not from S2 on the same port.

I appreciate the help.

Hello Rafael,

Thank you for posting your inquiry on the NVIDIA/Mellanox Community.

Based on your information, we noticed you have a valid support contract, therefor it is more appropriate to assist you further through a support ticket.

You will receive a notification from your new support ticket shortly.

Thank you,

~NVIDIA/Mellanox Technical Support.

Hey did you ever get this resolved? I believe I’m having the exact same issue as I have three connectx-5 En cards, each one in a separate computer, where I can communicate as you’ve described between them, but not directly between two of the nodes.

For anyone who might be having a similar issue, I was able to avoid this by connecting all three nodes into a ring topology. So setting host chaining to true for each card via mstconfig as mentioned above, and then connecting all three in a ring: S1 → S2 (both right port), S2 → S3 (left port on S2, right port on S3) and S3 → S1 (both left port). The port selection may not matter, but I figured I’d include it just in case. Not sure why they can’t ping directly, but having them chain works fine without any noticeable lag in my applications. For reference, I’m using 3 ConnectX-5 En cards, MCX516A-CCAT, with three ubuntu 20.04 machines with varying ages of mobo and cpus. For the En cards, using mlnx-en-5.9-0.5.6.0-ubuntu20.04-x86_64.iso. You can find instructions here https://docs.nvidia.com/networking/display/MLNXOFEDv461000/Installing+Mellanox+OFED

Once mounted, I used “install -vvv --with-nvmf --force” to install everything properly for ubuntu 20.04. I’m including all this extra info because I had such a hard time collating everything I needed, so hopefully this helps someone else.