AGX Xavier PCIe - real read performance.

We plan to interface the Xavier to our own FPGA board, using PCIe Gen 2, x4 lane width. (Stuck with that).
It seems logical to have the Xavier as the root, and our board the endpoint, however, the data movement is FROM the fpga board TO the Xavier. We need to sustain 12.8 Gbits/s. Ultimately, I need to know if it can actually be done.

As I understand it, IF the FPGA was initiating the transfers, with the overheads of 8b10b encoding, packet headers, etc., the WRITE performance would be around 13.5 Gbits/sec, which is sailing a bit close to the wind.

However, if the Xavier, (the requester), has to initiate the transfer by first sending a Read Request, and then the DMA engine in the endpoint, (the completer), sends back data in chunks of 64 or 128 bytes according to the RCB. The overheads are much higher, and make this a non-starter.

How DO we achieve the highest throughput given that we are stuck with Gen 2 and x4 width?
Can the endpoint initiate transfers to the Root?
What is the RCB of the Xavier, and how can it be changed? It dominates the read (in)efficeincy.
(I can see maximum Payload Size is 256 bytes).
Would it be better to have the our board be the Root?

Thanks in advance for any help!

Does your FPGA have a DMA engine built into it whose configuration registers are exposed to host through one of its BARs and the client driver (of your FPGA endpoint) running in the host can program FPGA’s DMA and start transfer? If yes, you can configure the system with Xavier being host and FPGA being the endpoint and have your DMA engine of endpoint start writing/dumping data to Xavier’s system memory. In this case, the perf should be around 13.5 Gbits/sec.
If your FPGA doesn’t have any DMA engine and is exposing the memory directly through BAR so that host can do a READ to BAR to get data FROM endpoint to host, then, perf would be very bad as we would be employing host CPU in this case and this is not a perf case.
Xavier supports max 256 bytes of payload, so yes, host can be configured for 256 MPS.

Hello vidyas.

The FPGA, (an Artix 7), can have a DMA engine - we are considering their Xilinx DMA Subsystem IP, and yes configuration can be done with the host.

I’m encouraged that you agree with my calculated read bandwidth! I guess I was initially concerned with the concept of the root having to make intervention for every payload. Once our data starts, it never stops, there’s just a single channel coming from an axi-stream port… in my ideal world I’d be transferring data in chunks of 64k!

But, if the read requests can be pipelined and the physical interface can be kept saturated so we actually DO get 13.5Gb/s, I’ll be happy.

… Many thanks for responding!

Good to hear that. Do let us know if you run into any issues.