Hey Everyone!
I’m facing a strange issue here that I can’t seem to solve :(
Long story short, I have around 20 servers to setup.
After installing the first 9, I realized that some of them are acting up.
The weird thing that is happening is that when setting up RDMA, most of them seem to be able to communicate nicely (receive/send) but some just won’t receive or send…
I have 2 types of servers where the second one is the most troublesome one.
The install is identical, they all have the same identical Ubuntu 20.04 setup and the first 6 are completely identical in terms of hardware components and the last 3 are identical as well.
They all however have the same identical NIC “ConnectX-5 Ex 40GbE Dual-Port QSFP28”.
Here’s an overview i made of which ones that are able and not able to ping (using rping).
so the MA server can connect to w1,w2,… and vice versa.
Same for the W* ones.
Then the weirdness starts.
C1, C2 and C3 can rping to MA, W1, W3, W4 but NOT to W2
Also C1 is able to rping C3 but not the other way around.
I hope someone can point me into the right direction where to troubleshoot because i’m out of ideas.
Already tried to replace transceiver modules, swap cables around…
Nothing seemed to have worked.
Edit: iperf3 works (both ways), regular ping works, just now installed the latest driver, nothing changed…
Any advice would be extremely appreciated!!
Thank you in advance!