ConnectX-3 with Cisco ACI

Hello Mellanox people

We have built an openstack implementation on a Cisco ACI network. Cisco 9k’s in spine/leaf topology. All the cloud nodes are configured with dual port ConnectX-3 nics at 40 Gb. The ports are LACP bonds to separate leaf switches. They are cabled with twinax. We plan on growing this particular clouds compute base by > %400, meaning we will add another 100+ compute nodes, adding more 9k’s and more mellanox nic’s. we are concerned about this as we have so many problems with what we have right now.

We are having huge problems, to the point where many applications have stopped. They either crash or timeout waiting on io. Storage for our cloud is over the network, in the form of a ceph cluster. So network latency is important.

Host configuration:

NIC: Mellanox Ethernet Controller MT27500 - ConnectX-3 Dual Port 40Gbe QSFP+ - Device 0079

Red Hat 7.2 kernel 3.10.0-514 (which is in the 7.3 tree)

Mellanox driver: Stock driver 2.2-1 from red hat kernel package

/etc/modprobe.d/mlx4.conf left as is and tried:

options mlx4_en pfctx=3 pfcrx=3

Symtoms:

Network latency, and possibly even packet loss. Can’t prove this yet, but I believe packets are disappearing. Causes outages and outright failures of client applications and services.

Ceph (storage cluster) has huge problems. Random 5-10 second outages. I believe its packet loss. Red Hat says our ceph problems are caused by network problems and won’t support it until its fixed… So they agree with me!

Cloud hosts drop millions of packets. Packet drop rate is directly proportional to data rates.

Cloud hosts send a lot of pause frames, even at low data rates. 50-100/s Is this normal?

Cloud hosts receives no pause frames

Our network support people say they are seeing a ton of pause frames, large number of buffer drops on switch uplinks

Question:

Does anyone have any experience with Mellanox cards in a Cisco ACI environment?

  1. flow control. The default configuration for the ConnectX-3 card flow control is LLFC or 802.3x port based. That is my understanding, please correct me if I’m wrong. Cisco ACI only supports 802.1Qbb priority based flow control. Would incompatible flow control, as in layer 2 congestion management manifest itself as what we are seeing?

I tried to enable the priority based flow control via the driver configuration. But couldn’t tell if it was enabled or not. The cards sends out pause frames, but can’t tell what type. Below shows the setting:

cat /sys/module/mlx4_en/parameters/pfctx

0

cat /sys/module/mlx4_en/parameters/pfcrx

0

My theory is we have a flow control problem, the switch is confused, the cards as well. My only concern is, even at low data rates, the cards are sending out 50-100 pause frames a second. At higher rates, 100’s a second. Is this normal?

When under load, say 5-7 Gb/s bursts of traffic, we get 100-200 dropped packets a second. One system could have 100 million dropped packets on the bond and / or physical nics.

If its not a flow control problem, what could be another root cause, so to speak. Sorry, this is a big question with a lot of factors.

Any help is a huge help. We will be growing this environment, but don’t want to commit to these specific switches/NIC’s until we can get this working.

Cheers

Rocke