I can’t answer the network side, but theoretically PCIe gen. 3, per lane, is 8 GT/s (giga transfers per second is a PCI-SIG designation). Call that 8 Gb/s per lane times the 128b/130b encoding, so per lane at PCIe gen. 3 theoretical max bandwidth and 8 lanes is: 8 * (128/130) * 8 Gb/s == 63 Gb/s
Then there is the encoding and clocking rate of the network as well. And the limitation of the memory controller. If everything is forced through a single CPU core, then that too is a limitation. If you happen to know the hardware IRQ used in a particular case, then you can go to “/proc/interrupts” and find out which core it runs on.
DMA might allow a lot more throughput than going through a CPU core, but eventually a CPU core needs to be used. I suspect that is a huge limitation. If the hardware IRQ part can be handled entirely via DMA, and the data can be offloaded to another core (one different than the one handling the hardware IRQ), then your chances of getting better bandwidth goes up. Even items like checksums will load things down just for pure data transfer and throwing the bytes away.
Note: Watch in htop (sudo apt-get install htop) and examine which CPU core goes up in usage during a heavy transfer over your PCIe. Also examine “/proc/interrupts” (it is a driver and not a real file) during a heavy transfer and see which hardware IRQ goes up to the extreme.
Also, a larger MTU (jumbo frames) will limit required resources and improve throughput, perhaps at the cost of latency if packets are actually smaller.