I have a Jetson TX1 developer KIt, and I inserted an Altera development board(RaggedStone4)
into the PCIe connector of the Jetson. I brought up PCIe(Gen1, x4) connection between the two board, I am using DMAs and Altera PCIe Hard IP on the FPGA side for the communication.
I can reach ~700MB/s for writing data from the FPGA to the TX1 over PCIe communication, but I only reach ~190MB/s for read direction. After I made embedded logic analyzer measurements on the FPGA side and saw that the communication working in both direction with 64 word sized burst, but in read direction the read response time(delay between read and read data valid) approximately 1-2 us.
Do you have any suggestion about this slow communication issue?
Zsolt,
It most likely you have this going to DRAM in FPGA.
A write cycle you can push the data and address over and let the DRAM do it without waiting. For a read you have to address the DRAM and wait some number of cycles before the data burst comes back.
If you add some amount of local single cycle SRAM blocks with wide data paths into your design that you can DMA to and from, the read and write speeds should compare better.
The DRAM in the TX1 must be quite fast to hit ~700MB/s for it reading to write into endpoint.
I am writing from the FPGA to the Tegra side, so the DRAM in the TX1 reach 700MB/s for writing data from the endpoint.
My problem is the other direction so when I start read request from the endpoint(FPGA) to the rootport(Tegra). The speed of the DRAM in the FPGA can not cause this issue because I can see from the logic analyzer(in the FPGA side) that the read response time(delay between the Read and ReadDatValid) is too high, so the Tegra answer slowly for a read request.
I hope this is some configuration issue on the Tegra side.
Thats interesting if its the TX1 side.
In the TX1 the DRAM is shared, but the ARMs should be running code from cache as not to step on the DRAM all the time. That is PCIe that is getting most all the DRAM cycles. DRAM’s will have extra cycle times when it crosses a pages etc so if other accesses happen, such as cache fill, from other then the endpoint DMA controller each time you most likely incurred paging cycle hits. This is true for both burst reads and writes. You don’t have something like the CUDA cores that hit DRAM all the time?
It is really how things need to be pipelined for a read cycles to be fast from DRAM. Unlike a write in which you are sending both burst data and address, a read cycle you must send the address and wait for the read data to be accessed and returned.
The way this is fast is that there are several outstanding read request being pipelined back to back to the DRAM controller, such that while its executing the data burst of one read it’s also sending in the next address to read… The PCIe root complex should be getting read request for a large number of bytes from DRAM on each accesses from the endpoint DMA controller. This all needs to be pipelined to some extend so that you may have several outstanding read request from endpoint send over to the TX1 root-complex. Or big TLP’s request that might get split into smaller returns of 32 or 64 bytes at a time.
Somewhere in all this the pipelining of read requests must be backed up, again each hardware segment must be able to handle multiple outstanding read requests, that are tagged when completed and sent back with the data burst.
Again I am assuming the endpoint memory write from DMA are not backing things up in this case once data is getting back over PCIe bus from TX1.
The FPGA PCIe and DMA should be able to handle multiple outstanding read request over PCIe.
You might check that there is not some setting for this or how many bytes can be requesting in a TLP
read request from root-complex in the DMA or PCIe endpoint IP.
There are a whole set of PCIe registers in the TX1 hardware guide which are setup by the host PCIe driver, there might be something there.
Does not solve the issue but might help in understanding where it is.
I am doing a few other things now, once I get back to having things setup I try and see what I get with my Xilinx setup. I know in my 2 lane PCIe 2.0 5G setup for writes to endpoint DRAM I was getting 350M bytes per sec for the full time my driver code took to do user buffer copy in TX1 DRAM, setup DMA descriptor table in TX1 DRAM, DMA in endpoint to read descriptor table in TX1, do the DMA reading the TX1 DRAM, and putting it into endpoint DRAM from over PCIe.
Here the main hardware limit is the 64bit 125MHz, 1Gbyte per sec, bus between DMA and DRAM in endpoint I have.
I see Altera has nicely provided 32/64 bit Linux drivers for their PCIe/DMA IP.