We´ve just build 9 RHES 6.4 servers (3xSL4540 blades) where we are running some network intensive software.
The thing is that eventually, connectivity between this servers will fail (for example, making an http request or an ssh attempt).
More specific details:
If we do an ssh from server A, to server B, the connection is established OK.
But 1 every out of 10 ssh attempts, the connection request won´t reach its destiny.
At first we thought we were having some network issues, but switch is configured straight away with no firewalls nor any special commands (except from STP, which won´t interfere here).
For trying to recreate the scenario, we make a simple test with a SSH small script:
for (( ; ; )); do ssh hostname.xxx “ls -l” ; done
This will execute an ls on a loop.
Randomnly, and every 20-30 attempts, one of them will fail (hanging the session).
We did the same test with some http / scp / smtp tests, and we are getting the same error (random timeout cause of packet missing).
We did try doing a tcpdump, and the packet failing to reach server B, is actually dissapearing (its being sent but its not being “recepted” by server B).
Kernel stacks and sockstats looks fine (no orphaned nor tables being filled up)
We have 9 servers.
3 out of 6 servers are presenting this problems. The rest is working fine.
All of them have exactly the same configuration, same kernel parameters and same OS + software.
We are using bonding on mode 1.
We are using these drivers:
mlx4_core: Mellanox ConnectX core driver v1.1 (Dec, 2011)
mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.0 (Dec 2011)
We did not wanted to upgrade driver/firmware since the problem is only being presented randomly on 3 servers out of 9.
Does this issue has been registered before?
Can you recommend me any extra/special parameters or configuration to test an discard hardware/link problems?