PCIe Configuration

Hello, we have developed a custom board based on the T5000 and are doing development using the R38.4.0 system version. We are currently encountering some issues.

We use PCIe C2 for a B-key interface to connect a 5G module, and use C3 and C5 for M-key interfaces to connect NVMe SSDs. Theoretically, these three signal paths should not interfere with each other, but we are observing strange behavior:

  1. When C3 is enabled in the system, and only the C5 NVMe drive and C2 5G module are connected, the system cannot boot from C5; it keeps rebooting before even entering the system.
  2. When C3 is disabled in the system (with UPHY0 Lane6/7 set to UFS mode by default), and only the C5 NVMe drive and C2 5G module are connected, the system can boot normally from C5.
  3. When all three devices are connected, the system cannot boot from C5, but can boot normally from C3.
  4. When only the two NVMe drives are connected (without the 5G module), both C3 and C5 can boot normally.

It seems that connecting the 5G module affects the C5 interface, yet disabling C3 resolves the issue. The only conclusion we can draw is that mutual interference exists among the three devices. We have not been able to identify the root cause.

Could you please look into this and provide any insights or suggestions?

Please at least share logs that with error happened and steps you are doing to enable those PCIe controllers.

I referred to this link for how to enable C3:
https://forums.developer.nvidia.com/t/jetson-t5000-pcie-c3-rp/347577

C5 remains unchanged; it is enabled by default on the devkit and also provides the MKey interface.

The 5G driver has been verified to work properly on both the Thor Devkit and the AGX Orin Devkit (using a module adapter card to expose the BKey interface).

This is the log for Situation 1.
allconnect-C5-noboot.txt (77.3 KB)

By the way, we found an issue when using the Thor Devkit: the system can boot up stably without an Ethernet cable connected to the RJ45 port, but there is a chance that it fails to boot into the system when powered on with an Ethernet cable plugged in. This does not happen every time, which is similar to our current problem. The reason is that the J85 RJ45 port on the devkit uses PCIe C2, the same PCIe lane we are using for the 5G module on our custom board.

What is the exact symptom of your issue?

I don’t get what you want us to check from this serial log.

The problem right now is that the three devices interfere with each other, but I cannot pinpoint the exact cause.
I haven’t been able to find any useful information in the serial log either, and I’m not sure what other logs you might need.
A similar issue occurred on the Devkit, so you can try to reproduce it as follows.

Hi,

我用中文說一下好了. 現在的問題是我從你的serial log裡面看不出哪邊有錯誤.
想請你說明一下

好的。

是的,我在log里也一样定位不到问题。

我们主板设计是用PCIe C2出Bkey接口连接5G模块,C3和C5都出Mkey接口连接NVME SSD。

问题情况大概就是,在将UPHY0 Lane6、Lane7配置为PCIe时,接5G模块和C5出的NVME,有概率无法进入系统,会出现反复重启的现象。但关闭PCIe C3,也就是按默认配置为UFS的时候,能正常进系统且5G工作正常。

上面的现象看似好像是C3影响了,但如果我三个设备都接了,大概率是C5出的那个NVME进不去系统。

现象很混乱,感觉三个接口相互干扰,无法确切地定位到是哪里的问题,不知道您能不能理解。

Hi,

不是. 你好像搞錯我現在想詢問的事情

我了解你前面提到的複製問題的情境. 現在我要說的事情是你給出來的log看起來連一次錯誤都沒有發生.
我們需要你提供發生問題當下的log. 比方說你講系統出現反覆重啟, 但你現在這份log一次重啟都沒有

如果你給的log確實有發生錯誤, 還請你說明一下是在哪個時間點(哪一行)

明白了。

刚刚提供的日志是情况1,也就是有发生重启的情况,但日志之所以没有出现反复重启是因为我通过串口连接PC打印的,它重启的瞬间串口设备就掉了,只能反复连接去打印,所以日志的结尾就是重启的节点。

[ 11.422614]
[ 11.725256] block nvme0n1: No UUID available providing old NGUID
[ 13.077733] iommu smmu3.0x0000008806000000: IOMMU driver was not able to establish FW requested direct mapping.
??INFO: END TASK:MB??
INFO: enter idle task.
INFO: END TASK:MB??
INFO: enter idle task.

了解, 所以你說的是機器在上面這段印完之後就重啟了是嗎?

是的。反复试了非常多次,没进系统的情况日志基本上都在这个位置就停了。

請問你碰上的問題是機器重啟還是機器power off?

照理來說UART serial console不會因為機器重啟就斷掉. 只有機器沒電的時候會發生console掉了的問題

我们主板设计uart转usb的电源用的VDD 3.3在系统复位的时候电是会断的,我们可以单独把debug_uart接出来再试看看。

好的. 可能要先請你們把這部份弄好抓到完整log之後我們才有辦法確認.

好的,我们先试试看再回复您。

你好,我重新导了一下日志,是情况3,进入不了C5 NVME的系统后卡在了日志最后那部分。
allconnect-C5-noboot.txt (236.6 KB)
麻烦帮忙看下谢谢!

這裡需要再請你幫個忙.

請你用Linux_for_Tegra/tools/demuxer/nv_tcu_demuxer這個工具把你的UART console分流之後讀取ccplex跟bpmp的uart log.

用法: Tegra Combined UART — NVIDIA Jetson Linux Developer Guide

AGX Thor如果不使用以上tool的話會有部份的log變成亂碼. 由於crash的地方剛好也有碰上亂碼的部份 需要用這個.

另外也想請教一下, 下面這個log是你所謂的重啟嗎?

[   10.032582] audit: type=1400 audit(1752259981.648:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="QtWebEngineProcess" pid=746 comm="apparmor_parser"
[   10.032770] audit: type=1400 audit(1752259981.648:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="buildah" pid=749 comm="apparmor_parser"
[   10.033296] audit: type=1400 audit(1752259981.648:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="Discord" pid=743 comm="apparmor_parser"
[   10.035508] audit: type=1400 audit(1752259981.652:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="ch-checkns" pid=756 comm="apparmor_parser"
[   10.035610] audit: type=1400 audit(1752259981.652:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="chrome" pid=758 comm="apparmor_parser"
[   10.080617] bluetooth hci0: Direct firmware load for qca/rampatch_usb_00190200.bin failed with error -2
[   10.080623] Bluetooth: hci0: failed to request rampatch file: qca/rampatch_usb_00190200.bin (-2)
[   10.295887] r8152 1-2.4:1.0: Direct firmware load for rtl_nic/rtl8153b-2.fw failed with error -2
[   10.304913] Loaded X.509 cert 'wens: 61c038651aabdcf94bd0ac7ff06c7248db18c600'tabase)
[   11.253112]
[   11.532722] block nvme0n1: No UUID available providing old NGUID
[   12.914036] iommu smmu3.0x0000008806000000: IOMMU driver was not able to establish FW requested direct mapping.
??INFO: END TASK:MB??
INFO: enter idle task.
INFO: END TASK:MB??
INFO: enter idle task.
??[   13.360528] [I][mhi_netdev_enable_iface] Prepare the channels for transfer
[   13.377562] [I][mhi_netdev_enable_iface] Exited.
???????
[0000.095] I> MB1 (version: 0.23.0.2-t264-75019003-378e427f)
[0000.095] C> Boot-mode : Coldboot
[0000.095] C> MB1 last_boot_error: 0x0
[0000.096] I> Entry timestamp: 0x000139ce
[0000.098] C> rst_source: 0x0, rst_level: 0x0

好的我重新导一下。

看现象是的。

請問最後一行

[ 12.914036] iommu smmu3.0x0000008806000000: IOMMU driver was not able to establish FW requested direct mapping.

跟重開機的第一行差了多久的時間?

[0000.095] I> MB1 (version: 0.23.0.2-t264-75019003-378e427f)