HOST <---------------->DEVICE
PCIE interface
tx2 NX <---------------> FPGA
JETPACK 4.6.6
The FPGA stores one frame of image data. The TX2 NX continuously reads this image frame via the PCIe interface and verifies its data. If an error occurs, it stops and prints the current MD5 value.
At the same time, we tested Xavier NX and Orin NX, and neither of them had this issue.
On the failing unit, which device is being examined when using the command “lspci
”? You might add the lspci
output to the forum.
Each lspci
line also mentions a slot which the device occupies on the left side of the output. As a contrived example, that slot might look something like “01:00.0
”. If we use that example, then we could list just that device with:
lspci -s 01:00.0
To get a fully verbose output for just that device, and to create a log file which can be attached to this forum:
sudo lspci -s 01:00.0 -vvv 2>&1 | tee log_lspci.txt
(then you could attach file “log_lspci.txt
” to the forum)
Also important is to know if this is a developer’s kit. Some people use a Jetson module with a third party carrier board, and this changes the device tree. One of those changes might include drive strength and slew rate for a PCIe component. Or it might completely change which pins provide the signals. None of the above will help if we assume this is a dev kit and it is really something else.
I forgot to mention: Make the lspci
log file after an error has occurred.
We are using NVIDIA’s official Jetson modules and carrier boards, and the FPGA is AMD’s official development board. The device tree has not been modified in any way. When the issue occurs, there is no output in dmesg
.
lspci:
01:00.0 Serial controller: Xilinx Corporation Device 9022 (prog-if 01 [16450])
Subsystem: Xilinx Corporation Device 0007
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Region 0: Memory at 44000000 (32-bit, non-prefetchable) [size=64M]
Region 1: Memory at 42000000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [60] MSI-X: Enable+ Count=32 Masked-
Vector table: BAR=1 offset=00008000
PBA: BAR=1 offset=00008fe0
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [1f0 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
Port Arbitration Table [500] <?>
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Kernel driver in use: xdma
Kernel modules: xdma
On the PCIe side, no errors were detected:
The AER first pointer is basically a linked list of pointers, and since it is NULL, there were no errors.
Regarding signal quality, this device is capable of PCIe v2, and is running at its full speed for that:
LnkCap: Port #0, Speed 5GT/s, Width x2, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
...
LnkSta: Speed 5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
...
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
(5GT/s is the transfer rate for PCIe generation 2)
Can you get an lspci
tree view? There might be a bridge between the device and the controller:
lspci -vt
If there is a bridge which is a parent to the PCIe slot, then we’d need to check the verbose lspci
of the bridge as well (after an error occurs), but bridges are rarely a problem. So far it looks like PCIe is not the source of the issue.
root@ubuntu:/home/ubuntu# lspci -vt
-[0000:00]—01.0-[01]----00.0 Xilinx Corporation Device 9022
We found that as the temperature increases, the error rate also increases. Is it possible that the errors are temperature-related? When the CPU temperature is 37.7°C, running a million read operations does not produce any errors, but when the CPU temperature rises to 54.4°C, errors become easier to reproduce.
Observations during the test experiment:
- When XDMA transfers data to memory (memory on the TX2 NX module), the first CRC32 check fails immediately. However, the second CRC32 check passes.
- In some cases, consecutive CRC32 checks consistently fail. After stopping the test, manually performing a CRC32 check on the memory data file shows the correct result.
- The issue is more easily reproduced when temperature rises. During testing, when the CPU temperature remained stable at 37.7°C, XDMA transferred data continuously over a million times without errors. However, when the temperature rose rapidly, the issue reappeared (CPU temperature at 54.4°C).
Additional tests conducted:
- Under the same test conditions, Orin NX and Xavier NX showed no errors even after over a million transfers.
- The issue persists in both SDK 4.6.1 and SDK 4.6.6.
- The same problem was reproduced on different hardware versions of TX2 NX (301 and 302).
Initial suspicion: There may be a performance issue with the cache or DDR. Currently, we are unsure of the exact root cause and how to further troubleshoot this issue.
Hi,
We would like to check about at which point the MD5 check is done. Could you share out the whole pipeline and tell us when did MD5 check get done?
There is no PCI bridge involved (the lspci
tree view says so), and the PCI itself shows no errors. Temperature is always something to consider. DMA could be related, but it won’t be the PCI link itself. That leads back to the question which @WayneWWW is now asking about where the checksum is performed. That would be able to suggest which part of the hardware (and driver) is involved, but I personally do not have that information.
FPGA(XDMA) -------PCIE------- TX2 NX
APP pipeline:
posix_memalign a buffer —> open xdma_c2h0-----> read to this buffer -----> crc32 this buffer
kernel pipeline:
system call read ----> char_sgdma_read_write
-----> check_transfer_align ---> xdma_xfer_submit --->
transfer_queue ---> xlx_wait_event_interruptible_timeout --> TRANSFER_STATE_COMPLETED
Driver code: GitHub - Xilinx/dma_ip_drivers: Xilinx QDMA IP Drivers — refer to the XDMA section.
Hi,
I’ll send you the test source code; it’s very simple
xdma_app_test.zip (1.7 KB)
xdma_app_test_two_pattern.zip (1.2 KB)
这是我从FPGA侧,两个地址取数据,存储到TX2 nx的BUF中的测试程序。校验再读出数据之后做的。
我直接用中文跟您澄清好了.
所以你說的這一段
- When XDMA transfers data to memory (memory on the TX2 NX module), the first CRC32 check fails immediately. However, the second CRC32 check passes.
- In some cases, consecutive CRC32 checks consistently fail. After stopping the test, manually performing a CRC32 check on the memory data file shows the correct result.
在你給的這兩份的source code的哪裡…?
還有 xdma_app_test.zip 是Jetson端的操作 然後 xdma_app_test_two_pattern.zip 是你FPGA端的嗎?
这两个app,都在TX2 NX上执行的,两次CRC32校验,连续执行两次。测试时发现CRC32校验不通过程序停止执行后,而读取内存数据再次校验这些数据是对的,因此为了验证ddr数据是否刷新慢了导致的问题。当第一次CRC32没通过,再执行第二次CRC。
测试结果是大部分第一次CRC通过了;少部分第一次CRC没通过,第二次通过;而两次CRC都没通过结束测试。
跟你確認一下, 所以你是說 xdma_app_test.zip是你一開始做的測試, 後面為了檢驗數據有沒有對, 所以你又做了一份xdma_app_test_two_pattern.zip?
再跟您確認一下我對"xdma_app_test_two_pattern"的理解, 看起來這裡是你改成讀兩張不一樣的raw檔來做這個測試
想問的是所以實際上這個結果跑下去的error是怎樣的狀況? 我看起來您是反覆一直做CRC check?
您理解的是对的,两张不一样RAW图,他们数据大小是一样的,反复读取,存储到TX2 nx上的相同buf里,先读取第一帧RAW校验CRC,不成功时打印校验值退出程序;成功时读取第二帧RAW校验CRC;如果第二帧RAW校验不成功时打印校验值退出程序。循环执行上面的流程。