Unexpected low performance of PCIe DMA to TX1


We have a working design of a Xilinx FPGA DMAing camera frames to TX1 over PCIe 4x1 set to 2.5 GT/s. Frames are received in DMA coherent memory by a custom kernel driver and forwarded on to user space.

However, we are seeing the data rate top out at around 300 MB/s. If we request a higher frame rate that results in a data stream beyond this rate, the FPGA experiences back pressure from the PCIe link, which backs up the camera stream and causes the stream to fail.

I have all the performance settings / clocks set to max, including the EMC clock. (When the EMC clock is set lower than max, the top achievable data rate is even less, so it is having an effect.)

According to all specs, shouldn’t the TX1 should easily handle this rate? Even though the destination is coherent memory, TX1 DDR bandwidth is rated at 25.6 GB/s.

This previous post mentions a similar problem:


But our performance is observed only for the DMA write, not the following read to user space.

Issue in the other thread is different in the sense that memcpy is not very efficient as dma_alloc_coherent() marks area as uncached and when CPU tries to work with it for memcpy perf is less. But, in this scenario, you are saying, FPGA based end point’s DMA is not able to dump (because of back pressure etc…) at higher data rates.
Can you give more information on the release you are using? (like 23.1 / 24.1 / 24.2 etc…) Also,
Can you try not to consume the data at user space? i.e let DMA dump the data to memory and discard that data. If we don’t see any issue with this, then, it is most likely to do with the whole kernel space-user space data transfers that might be putting the back pressure.
There are no known issues with TX1’s PCIe not being able to handle incoming data from end point’s DMA.
Also, it would be great if you can give the sequence of events that are taking place here.

I think we have narrowed it down to the fact that the FPGA is detecting the PCIe link as Gen1, and the data rate exceeds Gen1 bandwidth. I thought I saw in TX1 documentation that Gen2 is supported? When we configure the FPGA PCIe core to Gen2, the TX1 console prints a stream of PCIe bus errors as soon as the Tegra PCI controller driver is loaded at boot.

For background, L4T is 24.2. The FPGA delivers two camera image streams over PCIe via DMA at high frame rates. A custom kernel driver on the TX1 receives an MSI interrupt when each frame DMA is complete. User space application reads out the frames from their DMA location, but I have that part disabled while investigating this. The delivery of each frame reaches ~550 MB/sec, so two frames at once does push the limits of Gen1, and that’s where we see throttling all the way back to the IP cores generating the pixel streams.

Can you paste those errors here? Are those AER errors? If yes, can you please check if you have ASPM enabled? If yes, please try disabling it (by appending ‘pcie_aspm=off’ to kernel command line) and check once?

I do have pcie_aspm=off. This is the error that streams in:

pcieport 0000:00:01.0: AER: Corrected error received: id=0010
pcieport 0000:00:01.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0008(Receiver ID)
pcieport 0000:00:01.0:   device [10de:0fae] error status/mask=00000001/00002000
pcieport 0000:00:01.0:    [ 0] Receiver Error         (First)

Also, with PCIe debugging enabled, dmesg reports a steady stream of tegra_pcie_isr(1182) and handle_sb_intr(1141) alongside the errors, so there is activity there.

I should add that this happens before any camera streams are started. This is during link initialization.

As I see, errors are of type ‘Physical Layer’. This is due to bad electricals. If you have connected any interposer / converter cards in between, can you please remove them and check once?

Ok, it may be because there are two boards / connectors in between the TX1 and FPGA. All are designed with proper traces for PCIe but may be too many connector transitions for Gen2. No way to remove them currently…

Is there a way to put the TX1 PCIe in internal loopback mode (PMA/PCS loopback) in order to test part of the path?

>> Is there a way to put the TX1 PCIe in internal loopback mode (PMA/PCS loopback) in order to test part of the path?
This terminology seems specific to Ethernet. Do you mean FarEnd loopback in PCIe?

Hi readonly,

Have you found the cause and resolved this problem?
Any status update for this issue?