Problem with PCIe throughput on TX1


I have question, because I have problem with high speed transfers.
I have tried to transfer data from FPGA to TX1, FPGA writing data in big chunks (4MB, 512B per TLP). After each chunk MSI interrupt is generated and next transfer request is send immediately. Data are generated on fly, so write requests are generated as fast as possible.
And as fast as I can go is almost 300M/s (293M/s on avg, driver utilizes 2% CPU load), transferring 4GB of data takes no less than 14s. I hoped I can easy achieve 700M/s. I have looked inside FPGA using logic analyser and found out that after first bulk tlp memory write (512bytes) I need to wait for TX1 unreasonable amount of cycles before sending next, and all following writes are delayed from this point. On the other hand, everything works fine on x86_64 motherboard.

I am using PCIe Gen1 x4

# R24 (release), REVISION: 1.0

I am mapping memory using

dma_page = dma_zalloc_coherent(dev, 4194304, &dma_addr, GFP_DMA32);

I am not even checking data integrity (I have checked with another code, not speedtest code), and still transfer is not as high as I would expect of PCIe.

Is there some special way of handling such communication on this platform allowing to achieve close to maximum PCIe speed ? I have read that there is a problem with copy_to_user kernel function but I am not using any of such here.


I have found a little weakness in my previous design (or in TX1 I am not sure). Simple kernel driver was sending DMA request, 4M data, to FPGA, then next request was send in interrupt handler, so there was a delay between end of transfer and next request which happened to be significant. Now FPGA sends data constantly, and after each 4M it sends interrupt, I count to 1024 interrupts and calculate bandwidth.
Previous average was about 294M/s now is about 384M/s, but it is still little below transfer I am looking for. On x86 motherboard the same code works on ~700M/s with previous and this version of driver (the difference is hard to measure). Increasing DMA buffer to 128M does not change anything.

I found out fix (I hope) to my problems. After starting stress test apt-get install stress on TX1, PCIe DMA transfer from FPGA to TX1 memory increases rapidly to little below 700M/s, which is about what I wanted to achieve, immediately after stopping stress test PCIe transfer speed drops to about 380M/s.

For now I am good with this, but is there any configuration which I can set to keep bandwidth up when CPU is not under heavy load ? On x86 there is no measurable difference in PCIe speed with and without heavy load.

You may find this of interest (force CPU cores to max performance):

Thanks, maximizing EMC (Memory Controller) solved my problem.