Unfortunately far-end tuning is not sufficient for us to have a reliable Gen 2 communication. We would like to run PCIe tuning on TX1. What are the tuning parameters available on TX1 and how can we access them?
Before analyzing things at physical layer level, can we quickly do couple of things?
→ What is the end point device in this case?
→ What kind of AER errors are we observing? (as in Correctable or UnCorrectable or both)?
→ Do we have ASPM states (L0s / L1) enabled for the link in this case? (“lspci -vvvv” would tell us this)
→ Can we try disabling ASPM completely (add ‘pcie_aspm=off’ to kernel command line) and see if we still observe AER errors? (Over the time, we have observed that, there are many end points where ASPM-L0s/L1 is broken. So, its worth giving a try by disabling ASPM completely)
In general if your layout follow our layout rules there wouldn’t need manually tuning.
So please check PCIE Interface Signal Routing Requirements in JetsonTX1_OEM_Product_DesignGuide:
Is your total length over spec?
How about trace impedance, reference plan and series caps placement?
We will provide the trace length report separately. All the guidelines were followed and all the PCIe rules were followed. Impedance, reference plan and series caps placement are all within spec.
We have already proven that tuning the PCIe switch dramatically reduces the AER errors, now we need to do the same on TX1 registers.
Did you read the value of Root Error Status Register (34.4.6.13 T_PCIE2_RP_ERPTCAP_ERR_STS in TRM) to find out which kind of AER you met?
I saw many 0.22uF cap between AP and PCIe switch, not sure if the switch can handle this to isolate TX1 module side and AP side, and as you can see in OEM DG, only 0.1uF cap mentioned in PCIe design. Can you try testing signals without these 0.22uF cap?
I will follow-up separately. Can we take this discussion offline via email? Want to share EYE plots of pre/post SerDes tuning of the switch and seeing massive improvement. Also will confirm your questions. Since you already have our schematics this discussion can move quickly. We are time-crunched for production!
I understand no one has requested SerDes Tuning before but for our design its mandatory.
Just verified (cannot mention specifics in public forum). Our Carrier Board uses 0.1uF AC capacitors btw switch and Tx1-modules, verified in the schematic, which meets the NVIDIA’s design guide document.
If you are referring to our schematic, the 0.22uF caps are used in PCIE redriver (from CPU PCIE ports) → switch path and also used in filters for various power rails.
The AER error settings are hard to catch but will give it a shot.
YC> IDT SW ports connected to Tx1 modules are the end point.
→ What kind of AER errors are we observing? (as in Correctable or UnCorrectable or both)?
YC> So far, only Correctable errors are being observed.
→ Do we have ASPM states (L0s / L1) enabled for the link in this case? (“lspci -vvvv” would tell us this)
YC> We confirmed ASPM was enabled.
→ Can we try disabling ASPM completely (add ‘pcie_aspm=off’ to kernel command line) and see if we still observe AER errors? (Over the time, we have observed that, there are many end points where ASPM-L0s/L1 is broken. So, its worth giving a try by disabling ASPM completely)
YC> We’re going to re-run the test with ASPM disabled soon.
Good news. The team re-tested with ASPM disabled on 24 TX1 modules over 24 hours. No AER errors reported. Agree it was strange that AER errors would increase with ASPM disabled. Next step is to test ASPM enabled and SSC turned on. Will be in touch regarding register tuning.