driver porting issue: RX eventually stops

I have the linux 1.5.9 driver ported: mlx4_core and mlx4_en, for the ConnectX3-EN product.

I am running a stress test using multiple ttcp client/server pairs to completely load RX and TX on both ports.

This runs happily for approx two days, and then RX stops.

I can run tcpdump on the far side and I see ARP requests coming in and being responded to.

The ARP responses are never received.

If I do something which causes en_stop_port/en_start_port to happen (ifconfig down up works) then it is off happily running again.

I am running a watchdog, and I know I am goosing the RX receive for all rings periodically.

I have enabled a debug mode where I can dump the full ring and CQE and prod/cons values.

(kernel debugger is fragile and pretty much useless unless I fault or panic).

Anyone know what I might look at? I dump the raw eth_stats hoping for some errors: none.

I found a diagnostic report function, and I dump that, and it returns all zeros.

At one point, when the ARP responses were not being received, I manually added the MAC addresses to the ARP table and some things started working again. I am beginning to believe that is one one or more rings which go bad, and when the RX ring which gets all the ARP responses clogs up, the rest of the traffic necessarily ceases flowing.

If anyone has any experience with low level debug of the adapter state, in particular regarding anything which might shed light on why one queue out of 4 stops completely, please let me know.

figured it out finally.

There is a bug in the linux driver code. I have no idea how it manages to work under linux, but there is some code in the RX path which breaks for me and causes the RX ring to be depleted in some fashion, right around the 32-bit wrapping.

I note that this area changed in 1.5.10. The code in question was deleted.


Are you using the two ports of the NIC or just one?

What is the OS version and kernel you are using?

Did you try the newer EN driver? (1.5.10 or the latest MOFED package).

I am using OS X version 10.8.2.

It doesn’t matter if I use one of both ports. It just takes longer to hit it with two ports.

I have eyeballed the differences between 1.5.9 and 1.5.10 and see nothing which would effect this.

My most recent trial, I reduced the # rings to 1 RX ring. It has stopped. Here is some debug info I (now) have showing the ring params:

ring 0/1 cq 0xffffff80da688000 cqn 8c cons 3ff prod 3ff bytes 2335f59df0e4 pkts ffffffff

Wow. That is nice. The pkts counter is just software used for adjusting moderation, so that fact that it it about to wrap is incidental.

The last-consumed slot at 3fe shows:

cqe[3fe] owner_sr_op 81 vlan_my_qpn 105 status 1440 byte_cnt 233a checksum ffff

and the next one to be consumed:

cqe[3ff] owner_sr_op 1 vlan_my_qpn 105 status 1440 byte_cnt 233a checksum ffff


cqe[0] owner_sr_op 81 vlan_my_qpn 105 status 1440 byte_cnt 233a checksum ffff

So i will indefinitely examine that location waiting for the owner_sr_op bit to change.

If I look at the stats returned from DUMP_ETH_STATS I see that the RX count still advances. Something is receiving the packets, and dropping them. I get no errors in any of the stats counters I have examined.

If I do a "ping -f -b " from the other side, I see the RX counters for BCAST frames received from DUMP_ETH_STATS increasing nicely.

It if was just a matter of missing an interrupt, the polling would fix things. However, the fact that I can repeatedly examine the next-to-be-consumed cqe and it is not being updated has me perplexed.

Are there some credits or something which need to be replenished?