Jetson TX2 + Xilinix PCIe

Good Evening,

I plan to send data from a Xilinx FPGA to the Jetson TX2 via PCIe x4 (Xilinix eval board connected to the NVIDIA TX2 carrier board for benchtop prototype). I see a post where someone else has accomplished this task, but with some difficulty. Please refer to this link:

https://devtalk.nvidia.com/default/topic/988351/jetson-tx1/what-is-the-current-status-of-pcie-dma-/2

A few posts in, @vidyas said he modified the driver to get it working (https://devtalk.nvidia.com/default/topic/988351/jetson-tx1/what-is-the-current-status-of-pcie-dma-/post/5077803/#5077803) (@kayccc: apologies for the PM, I just realized you were not the moderator who fixed the code!)

I was wondering if you (the moderators or @vidyas) have this fixed driver so that I do not need to go through the same troubles as @mrbmcg.

R/
Nick

Hi,
In that thread, developer has some issue with the driver (their driver) and we fixed that issue.
That driver is not something Nvidia provides/owns/maintains. We just helped them to get their stuff working.
In your case, you would be writing your driver and we can provide any help required on need basis.

Thank you @vidyas. Can you help me get started and point me to a reference design that uses the TX2 PCIe driver? This is really what I was hoping for out of the conversation. Thank you.

If I understand your requirement correctly, you need to write an end point device driver/client driver for your end point. Depending on the type of functionality it has, kernel has all varieties of PCIe client drivers.
For a generic introduction of how to write a PCIe end point driver, Documentation/PCI/pci.txt in kernel source can be a good starting point.

@vidyas, thank you for the information – I will take a look in the documentation that you referenced.

To clarify, I will be connecting a Xilinx FPGA as the completer endpoint device with the TX2 as the requester (root complex), similar to this user’s post https://devtalk.nvidia.com/default/topic/977168/fpga-pcie-device-not-initialized-by-tx1-root-complex/?offset=4.

I see a handful of users have worked on this task, so I was hoping that there would be a generic reference design/template available to get started so that I wouldn’t have to re-invent the wheel. That way I could start on my custom firmware/user application without the overhead of developing the standard protocols (init, polling/irq handling, etc.).

Well, these are end point device drivers and Nvidia doesn’t own/maintain any.

[s]Hello, and thank you in advance for any help.

I know this post has been answered already and I can move it to another thread if necessary.

I am a colleague of @chirstnp_work who has been working on this same problem with him for several months now. I have managed to transfer the amount of data our application requires without data errors using a TX1 and Xilinx’s loopback FPGA example. My steps to achieve this are as follows (I’ll try to be as clear as possible; I apologize for any redundancy):

  1. Install Jetpack 2.3.1 (L4T 24.2.1)
  2. Unpack L4T 24.2.1 kernel to TX1 and compile/install using jetsonhacks build script
  3. (https://github.com/jetsonhacks/buildJetsonTX1Kernel/tree/v1.0-L4T24.2.1)
  4. Reboot
  5. Download Xilinx XDMA driver sources (https://www.xilinx.com/Attachment/Xilinx_Answer_65444_Linux_Files.zip)
  6. Unpack Xilinx XDMA sources to /home/nvidia
  7. Modify RX_BUF_PAGES in Xilinx driver include/xdma-core.h from 256 to 2048
  8. Download version of Xilinx xdma_core.c file with cyclic buffer disabled File: https://forums.xilinx.com/xlnx/attachments/xlnx/PCIe/9115/1/xdma-core_cyclic_buffer_disabled.c Forum: https://forums.xilinx.com/t5/PCI-Express/PCIE-DMA-subsystem-AXI4-Streaming-c2h-transfers/td-p/791701
  9. Apply the attached patch
  10. Build Xilinx XDMA sources and run load_driver.sh with FPGA plugged into PCIe and programmed with loopback design
  11. At this point, multiple transfers of size 8M will complete without data errors, but dmesg will still show mc-errs and smmu faults.

Before the patch is applied, the modified xdma-core.c file (cyclic buffer disabled), will complete small numbers of 8M transfers successfully, but at larger numbers (>~64) it will cause a kernel crash because of a BUG_ON macro that verifies the “transfer” pointers are not null. The exact transfer number when the crash occurs is unpredictable, but it always happens after the WARN_ON macro (ln 1380) executes. My modifications were designed to prevent the driver from calling the functions that triggered the BUG_ON macros if their arguments were null.

Let me qualify my changes to xdma-core.c by saying that I don’t believe they are a good solution, simply a very crude workaround to show proof-of-concept. In fact, I’m surprised I haven’t noticed more serious problems yet.

My question is: the changes I made to the modified xdma-core.c file prevented the driver from triggering the BUG_ON macros unpredictably, but I’m not sure why the modified driver works in the first place while the original does not. My guess is that it is some kind of race condition where the size of the transfer list has been updated, but the data structure to which the pointer is meant to refer has not been allocated yet.

I am hoping there is some relatively obvious reason why one driver version works while the other does not based on the TX1 architecture and the L4T implementation.[/s]
xdma-core_cyclic_buffer_disabled.c (154 KB)
xdma_core_modifications.patch.txt (2.9 KB)
xdma-core-MODDED.c (155 KB)

Apologies for posting to an answered thread. I have moved my post to the TX1 forum (https://devtalk.nvidia.com/default/topic/1030940/jetson-tx1/xilinx-fpga-pcie-driver-working-on-tx1/)