Pci errors on agx orin with samsung 990 2TB

added a samsung 990 to my new orin and it throws up PCI errors fairly consistently. is there a known compatibility issue or something else? just reflashed it to jetpack 6 and its new out the box.

5.15.148-tegra

pcieport 0004:00:00.0: AER: Corrected error received: 0004:00:00.0
[12096.994854] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[12096.994856] pcieport 0004:00:00.0: device [10de:229c] error status/mask=00000001/0000e000
[12096.994859] pcieport 0004:00:00.0: [ 0] RxErr (First)
[12102.097114] pcieport 0004:00:00.0: AER: Corrected error received: 0004:00:00.0
[12102.097132] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[12102.097135] pcieport 0004:00:00.0: device [10de:229c] error status/mask=00000001/0000e000
[12102.097137] pcieport 0004:00:00.0: [ 0] RxErr (First)
[12233.398851] pcieport 0004:00:00.0: AER: Corrected error received: 0004:00:00.0
[12233.398871] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[12233.398874] pcieport 0004:00:00.0: device [10de:229c] error status/mask=00000001/0000e000
[12233.398876] pcieport 0004:00:00.0: [ 0] RxErr (First)

same,

steps to reproduce, install a 2TB samsung 990 SSD in an orin

git clone GitHub - microsoft/onnxruntime: ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
cd onnxruntime
./build.sh

elsewhere to check pice errors
sudo tail -f /sys/log/syslog
or dmesg

Nov 4 14:25:29 ubuntu kernel: [ 3937.502805] pcieport 0004:00:00.0: AER: Corrected error received: 0004:00:00.0
Nov 4 14:25:29 ubuntu kernel: [ 3937.502820] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Nov 4 14:25:29 ubuntu kernel: [ 3937.502823] pcieport 0004:00:00.0: device [10de:229c] error status/mask=00000001/0000e000
Nov 4 14:25:29 ubuntu kernel: [ 3937.502825] pcieport 0004:00:00.0: [ 0] RxErr (First)

Is this SSD able to reproduce same error on NV devkit?

Those are PCIe corrected errors, so not actually causing any functional issue, BUT they shouldn’t be happening so frequently - it means you have poor signal integrity between the SSD and the Jetson and you may end up with uncorrectable errors which would cause a problem. I would check the connector for any dirt/debris and clean if necessary. Just inserting and removing the SSD from the connector a few times would help to clean the contacts. If it doesn’t fix the problem you may have faulty hardware, either the Jetson or the SSD

1 Like

this is an NVIDIA Jetson AGX Orin 64GB Developer Kit new out of the box , same as the SSD, connectors are clean.

I had nearly the same problems with Samsung 990 Pro on a ConnectTech Rogue Orin AGX board. Solution was to limit the M.2 slots to PCIe Gen3 in the device tree.

yeah i think its a issue with the orin not liking the 990, i bought a second samsung 990 same issue, just installed a micro/crucial SSD and nothing so far.

│ Current version: P9CR40D / Crucial P3 Plus 2TB PCIe Gen4 3D NAND NVMe M.2 SSD
│ Vendor: Micron/Crucial Technology (NVME:0xC0A9)

We’ve verified there’s no issue with Samsung 980PRO 2TB.
Could you help to clarify if the issue is specific to Samsung 990 2TB?

We don’t have this NVMe SSD to verify.
Please also try if using PCIe Gen3 through configuring max-link-speed in device tree can help.

since i swapped to the other SSD OEM I have not seen the issue. i do have some 980s i could try as well.

nvidia did send me another orin dev kit last week so i can try the 990 2TB on that one as well.

i tried two different 990s on the first orin and they both showed the same behaviour immediately.

Hi charlie-wx,

Thanks for your info.
Please let us know your result to clarify if the issue is specific to Samsung 990 2TB.
(i.e. If 980 can work but 990 can not)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.