SL4540 GEN8 + MT27500 Family [ConnectX-3]  ---> random packets "dissapearing"


We´ve just build 9 RHES 6.4 servers (3xSL4540 blades) where we are running some network intensive software.

The thing is that eventually, connectivity between this servers will fail (for example, making an http request or an ssh attempt).

More specific details:

If we do an ssh from server A, to server B, the connection is established OK.

But 1 every out of 10 ssh attempts, the connection request won´t reach its destiny.

At first we thought we were having some network issues, but switch is configured straight away with no firewalls nor any special commands (except from STP, which won´t interfere here).

For trying to recreate the scenario, we make a simple test with a SSH small script:

for (( ; ; )); do ssh “ls -l” ; done

This will execute an ls on a loop.

Randomnly, and every 20-30 attempts, one of them will fail (hanging the session).

We did the same test with some http / scp / smtp tests, and we are getting the same error (random timeout cause of packet missing).

We did try doing a tcpdump, and the packet failing to reach server B, is actually dissapearing (its being sent but its not being “recepted” by server B).

Kernel stacks and sockstats looks fine (no orphaned nor tables being filled up)

We have 9 servers.

3 out of 6 servers are presenting this problems. The rest is working fine.

All of them have exactly the same configuration, same kernel parameters and same OS + software.

We are using bonding on mode 1.

We are using these drivers:

mlx4_core: Mellanox ConnectX core driver v1.1 (Dec, 2011)

mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.0 (Dec 2011)

We did not wanted to upgrade driver/firmware since the problem is only being presented randomly on 3 servers out of 9.

Does this issue has been registered before?

Can you recommend me any extra/special parameters or configuration to test an discard hardware/link problems?



How you recover after you get into this state? Does it recover itself?

“Lost” packets should leave some clues somewhere: do you see Rx/Tx errors or discards on hosts (with ifconfig for example) or on the switch ports you are using?

Other thing to check if you still have IP address assigned and ports are up on both ends when that happens: network manager is known to reset manually set IP config if it is enabled with DHCP on the interface - it would saddenly drop manually configured IP address.

Hi all,

Just an updated.

After updating our drivers version to v2_2-1_0_1 the issue seems to have stop.

We are making some more test to discard any other issue, but so far so good, upgrading the driver has resolve the issue.