L4T 35.5.0: Crash in UEFI when adding a PCIe device

Hey,

we’re building a custom carrier board, but we’re running into issues bringing up a PCIe device. We’ve successfully connected the PCIe-device Eval board with a Jetson Orin Nano devkit and can use it successfully. Now we’re transitioning from the two EVK boards towards a custom carrier board, but we’re having issues getting the PCIe link up and running.

We’ve followed the Nvidia documentation for the PCIe setup.

The schematics of the custom board is mostly identical when compared to the two eval kits connected, we’ve only added manual control of the PCIe device reset, while the eval kit pulls it out of reset once powered. If we get the PCIe device out of reset during power up / MB2, our UEFI crashes with the Exception below (UEFI build in debug mode).

Are there additional debug instructions for PCIe bringup available?

PCIe Controller-1 Link is DOWN
add-symbol-file /home/user/develop/kirkstone/yocto-build/build/machine/tmp/work/machine-cosy-lin0
Loading driver at 0x00239544000 EntryPoint=0x0023954B558 EqosDeviceDxe.efi

add-symbol-file /home/user/develop/kirkstone/yocto-build/build/machine/tmp/work/machine-cosy-lin0
Loading driver at 0x00239531000 EntryPoint=0x0023953B534 PciHostBridgeDxe.efi

ÿäUnhandled Exception in EL3.
x30            = 0x0000000050000c4c
x0             = 0x0000000000000000
x1             = 0x00000000be000011
x2             = 0x0000000000000000
x3             = 0x0000000000000011
x4             = 0x0000000000200000
x5             = 0x000000026e9fe438
x6             = 0x0000000002190000
x7             = 0x0000000002190000
x8             = 0x0000000000000003
x9             = 0x0000000041223020
x10            = 0x000000000003073d
x11            = 0x000c010200000000
x12            = 0x000000000a0341d0
x13            = 0x0101000000060101
x14            = 0x000c010200000000
x15            = 0x000000000a0341d0
x16            = 0x000000023a0acdbc
x17            = 0x00000000000000e5
x18            = 0x000000023a0b92f0
x19            = 0x0000000000190000
x20            = 0x00000002361ad020
x21            = 0x0000000000000002
x22            = 0x00000002395eb1b0
x23            = 0x0000000000000001
x24            = 0x0000000000000001
x25            = 0x000000023953fead
x26            = 0x0000000000000000
x27            = 0x000000026e9fe5f8
x28            = 0x0000000000000004
x29            = 0x000000026e9fe490
scr_el3        = 0x000000000003073d
sctlr_el3      = 0x0000000030cd183f
cptr_el3       = 0x0000000000000000
tcr_el3        = 0x0000000080823518
daif           = 0x00000000000002c0
mair_el3       = 0x00000000004404ff
spsr_el3       = 0x00000000600003c9
elr_el3        = 0x000000023a0b2280
ttbr0_el3      = 0x0000000050026341
esr_el3        = 0x00000000be000011
far_el3        = 0x0000000000000000
spsr_el1       = 0x0000000000000000
elr_el1        = 0x0000000000000000
spsr_abt       = 0x0000000000000000
spsr_und       = 0x0000000000000000
spsr_irq       = 0x0000000000000000
spsr_fiq       = 0x0000000000000000
sctlr_el1      = 0x0000000030d00800
actlr_el1      = 0x0000000000000000
cpacr_el1      = 0x0000000000300000
csselr_el1     = 0x0000000000000000
sp_el1         = 0x0000000000000000
esr_el1        = 0x0000000000000000
ttbr0_el1      = 0x0000000000000000
ttbr1_el1      = 0x0000000000000000
mair_el1       = 0x0000000000000000
amair_el1      = 0x0000000000000000
tcr_el1        = 0x0000000000000000
tpidr_el1      = 0x0000000000000000
tpidr_el0      = 0x0000000080000000
tpidrro_el0    = 0x0000000000000000
par_el1        = 0x0000000000000800
mpidr_el1      = 0x0000000081000000
afsr0_el1      = 0x0000000000000000
afsr1_el1      = 0x0000000000000000
contextidr_el1 = 0x0000000000000000
vbar_el1       = 0x0000000000000000
cntp_ctl_el0   = 0x0000000000000005
cntp_cval_el0  = 0x000000001d9a3a4f
cntv_ctl_el0   = 0x0000000000000000
cntv_cval_el0  = 0x0000000000000000
cntkctl_el1    = 0x0000000000000000
sp_el0         = 0x000000023a0b92f0
isr_el1        = 0x0000000000000040
cpuectlr_el1   = 0xa000000b40543000
gicd_ispendr regs (Offsets 0x200 - 0x278)
 Offset:                        value
0000000000000200:               0x0000000000000000
0000000000000204:               0x0000000000000000
0000000000000208:               0x0000000000000000
000000000000020c:               0x0000000000000000
0000000000000210:               0x0000000000000000
0000000000000214:               0x0000000000000000
0000000000000218:               0x0000000000010000
000000000000021c:               0x0000000000020000
0000000000000220:               0x0000000000000000
0000000000000224:               0x0000000000000000
0000000000000228:               0x0000000000000000
000000000000022c:               0x0000000000000000
0000000000000230:               0x0000000000000000
0000000000000234:               0x0000000000000000
0000000000000238:               0x0000000000000000
000000000000023c:               0x0000000000000000
0000000000000240:               0x0000000000000000
0000000000000244:               0x0000000000000000
0000000000000248:               0x0000000000000000
000000000000024c:               0x0000000000000000
0000000000000250:               0x0000000000000000
0000000000000254:               0x0000000000000000
0000000000000258:               0x0000000000000000
000000000000025c:               0x0000000000000000
0000000000000260:               0x0000000000000000
0000000000000264:               0x0000000000000000
0000000000000268:               0x0000000000000000
000000000000026c:               0x0000000000000000
0000000000000270:               0x0000000000000000
0000000000000274:               0x0000000000000000
0000000000000278:               0x0000000000000000
000000000000027c:               0x0000000000000000

If this cannot reproduce on NV devkit, I can only suggest you to review the hardware.

Most of users saying their board design is same as devkit, but most of cases turned out not.

Hey,

I can not disagree here, I can only trust others in terms of the details of the electrical design.

We’ve gone over the design with our electrical guys, placing the EVK board designs and our designs side by side. We’ve a slightly different setup in the reset of the PCIe slave reset: The combination of two EVK boards deassert PCIe slave reset on power on, while on our hardware we manually control the reset and deassert it during MB1. We’ve changed our design to get the PCIe slave out of reset on power on of the Jetson, but that did not resolve it. We’ve also added the I2C link to the PCIe Slave device and can verify that it’s out of reset and responds way before UEFI tries to talk PCIe to it.

To allow for some debugging, is there a way to get some details about the error state of the PCIe setup from the register dumps? Or a way to enable additional debug messages aside from building with EDK2_BUILD_RELEASE = "0"?

If this device is not needed in UEFI, you could also disable that pcie controller in UEFI.

There is nothing called additional debug message anymore as your already using debug build.

I’ve to waive the flag here:

We’re “most cases” and yes, our design was slightly different electrically.

When we changed the EVK in the same way, we can reproduce the failure on the EVK boards.

We’ve now tried that, but that only delays the crash from UEFI time to Kernel time. We’re currently reviewing how to progress with our current hardware or if we need a hardware spin.