What is causing rx_discards_phy to occasionally increase?

I’ve written an application that receives UDP multicast data from the network, copies it to a GPU, copies results back and transmits them over the network (again with UDP multicast). The application uses raw ethernet verbs with multi-packet receive queues. Sometimes, maybe once every few minutes, the rx_discards_phy counter will jump slightly, usually by about 1-50. I’m struggling to figure out what could be causing it. The bandwidth is fairly high (about 68 Gb/s in, 54 Gb/s out) but the packets are large (5KB+) so the packet rate isn’t particularly high.

Is there some way I can nail down what is causing this, and ideally prevent it? For example, are there extra counters somewhere that will indicate whether this is due to PCIe back-pressure versus some internal bottleneck?

Here are things I’ve already tried or steps I’ve taken to tune the system:

  • Disabled local loopback via /sys/class/net//settings/force_local_lb_disable.
  • Pinned the processes to the CPU socket containing the NIC and GPU, and pinned a separate core for the thread servicing each QP.
  • Set the BIOS to Performance profile (Dell R740) and passed intel_pstate=disable intel_idle.max_cstate=1 pcie_aspm=off to the kernel.
  • Used huge pages for the memory regions.
  • Tried both event-driven and polling approaches to servicing the QPs.
  • Disabling the GPU host<->device transfers to reduce system DRAM usage.
  • Increase receive ring size with ethtool -G rx 8192 (was 1024).
  • Increase PCIe max read request size from 512 to 4096.

The Infiniband out_of_buffer counter doesn’t increase, so I believe that I’m making WQEs available quickly enough. The output_pci_stalled_* counters don’t increase either. For some reason I don’t have the outbound_pci_buffer_overflow counter, even though the machine is running Linux 5.0 and the documentation (https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters) says it’s available from 4.14.

I’m running MLNX OFED 4.7-1.0.0 and the firmware is 16.25.6000.

Thanks for any help

Bruce

I’ve simplified the test case a lot, including eliminating the GPU from the picture and replacing most of the networking code with raw_ethernet_bw. I can’t be 100% sure that I haven’t changed the source of the problem along the way, but I’m seeing similar symptoms. I’m having a machine do three things:

  • Receive UDP multicast data at 68 Gb/s, 5KiB packets. I couldn’t figure out how to make raw_ethernet_bw’s client-server mode work, so I’m using mcdump as the receiver (https://spead2.readthedocs.io/en/latest/tools.html#mcdump) and raw_ethernet_bw (on another machine) as the sender.
  • Send data at 54 Gb/s, 8KiB packets, using raw_ethernet_bw. I was also able to produce packet drops using lower bandwidth and higher packet rates (small packets).
  • Thrash the memory system by having 4 threads run large memcpys in a loop, achieving 24 GB/s (for each of read+write - so 48 GB/s total, out of a theoretical 107 GB/s for a Xeon Silver 4114).

This leads to about 1-100 packets dropped every few seconds. Increasing the number of memcpy threads substantially increases the rate at which packets are dropped. The interesting thing is that when I turn off the transmit I don’t get the dropped packets, even if I increase the number of memcpy threads to add another 12GB/s of memory traffic.

Is it possible that the tx, bogged down by PCIe latency caused by the memcpys, is hogging the PCIe bus and preventing rx from writing the data it received? And if so, is there anything that can be done to improve it?

Let me know if you’d like more information about the exact commands I’d running, or if it will help for me to file a support request where I can attach sysinfo about the machines involved.

Thanks in advance

Bruce

Check neo-host as a tool to getting advanced diagnostic

https://www.mellanox.com/page/products_dyn?product_family=278&mtag=mellanox_neo_host

Please, try also ‘raw_ethernet_bw’ application from perftest package.

Exclude GPU from the picture and see if the issue can be simplified and GPU can be excluded from the picture

Try to run the test on the same machine acting as server and client

Thanks, I wasn’t aware of neo-host, and just looking at the manual it looks like it should give some insight. I’ll give it a try.

Neo-host certainly gives a lot more information, although without detailed knowledge of the NIC internals I’m not sure how to interpret it all. I still have more investigation to do, but I hope you don’t mind answering a few questions so long:

  • When packets drop, the “RX buffer full port 1” count shows the buffer being full for a few hundred cycles. It surprises me a bit that when it overflows it doesn’t overflow much more. How big is this buffer and is there a setting to increase it?
  • It shows about 50% of the time for PCI read stalled by lack of completion buffers. Does that imply that the PCI latency is unusually high? Would it cause problems for network Rx or is it purely a Tx function?
  • PCI latency seems to have occasional spikes: normally it’s 2-3µs, but sometimes the max latency goes as high as 15µs. I’m still investigating whether it’s correlated with the packet losses, but is it something I should worry about or normal operation when there is high traffic?