External PCIe device disconnected at driver load

I am connecting an external PCIe device on Jetson AGX Thor, but it is disconnected at driver load.

In the dmsg.txt I see the AER of the issue:

[ 15.280087] pcieport 0001:00:00.0: AER: Correctable error message received from 0001:00:00.0

[ 15.280104] pcieport 0001:00:00.0: DPC: containment event, status:0x3f01 source:0x0000

[ 15.280110] pcieport 0001:00:00.0: DPC: unmasked uncorrectable error detected

[ 15.280127] pcieport 0001:00:00.0: AER: found no error details for 0001:00:00.0

[ 15.280154] pcieport 0001:00:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)

[ 15.280157] pcieport 0001:00:00.0: device [10de:22d8] error status/mask=00040000/04400000

[ 15.280160] pcieport 0001:00:00.0: [18] MalfTLP (First)

[ 15.280163] pcieport 0001:00:00.0: AER: TLP Header: 4a008001 01000004 00000028 72040000

[ 15.280274] pci 0001:01:00.0: AER: can’t recover (no error_detected callback)

A Pcie Protocol analyzer see this transaction as errorless:

Note that the “Raw Symbol Display” shows same data as the “TLP header” of the bad packet in the AER message.

How can I debug why does Jetson AGX Thor sees this packet as malformed?

Is this a custom carrier board?
What PCIe device in use here?

hi Wayne,

We are using the standard Thor devkit we bought from C.R.G.

We connected our device through M.2 slot. The device can work up to PCIe Gen 4 but even Gen 1 is reporting the same problem.

The same device was also connect to Orin devkit without issue and is working properly on the PCIe side.

Could you share full dmesg and lspci -vvv when issue happened?

After further debugging we might have found the root cause :

This is the failing scenario

Our PCIe card returns 000 instead of 001 on Attributes. Is there is a way to configure the platform to use “No Snoop bit” = 0?

Hi,

We don’t think this issue has anything to do with Snoop bit.

Memory read request 1 DW(4 bytes) is sent from Tegra to EP.

EP is sending completion response of size 1DW (4 bytes), but byte count(outstanding bytes) is set to 4 bytes, it should be 0.

I believe this is the reason for malformed TLP error, you need to check why the endpoint side is sending byte count as 4 instead of 0.