TX1 PCIe Channel Tuning

We have TX1 connected through PCIe x4 and observing PCIe issues (AER errors in general) on the connection.

Measuring from the Tx1_Rx input side below are measurements we have:

Gen 1 mean measurements:
Before far-end tuning:
Eye width 332.44ps
Eye height 361 mv

After far-end SerDes tuning:
Eye width 351.11ps
Eye height 435 mv

Gen 2 mean measurements:
Before far-end tuning:
Eye width 110.22ps
Eye height 137 mv

After far-end SerDes tuning:
Eye width 130.67ps
Eye height 280 mv

Unfortunately far-end tuning is not sufficient for us to have a reliable Gen 2 communication. We would like to run PCIe tuning on TX1. What are the tuning parameters available on TX1 and how can we access them?

Best,

Before analyzing things at physical layer level, can we quickly do couple of things?
-> What is the end point device in this case?
-> What kind of AER errors are we observing? (as in Correctable or UnCorrectable or both)?
-> Do we have ASPM states (L0s / L1) enabled for the link in this case? (“lspci -vvvv” would tell us this)
-> Can we try disabling ASPM completely (add ‘pcie_aspm=off’ to kernel command line) and see if we still observe AER errors? (Over the time, we have observed that, there are many end points where ASPM-L0s/L1 is broken. So, its worth giving a try by disabling ASPM completely)

In general if your layout follow our layout rules there wouldn’t need manually tuning.
So please check PCIE Interface Signal Routing Requirements in JetsonTX1_OEM_Product_DesignGuide:
Is your total length over spec?
How about trace impedance, reference plan and series caps placement?

Dear Jim,

We will provide the trace length report separately. All the guidelines were followed and all the PCIe rules were followed. Impedance, reference plan and series caps placement are all within spec.

We have already proven that tuning the PCIe switch dramatically reduces the AER errors, now we need to do the same on TX1 registers.

Are you familiar with our design?

Saran

Hi saransaund,

Did you read the value of Root Error Status Register (34.4.6.13 T_PCIE2_RP_ERPTCAP_ERR_STS in TRM) to find out which kind of AER you met?

I saw many 0.22uF cap between AP and PCIe switch, not sure if the switch can handle this to isolate TX1 module side and AP side, and as you can see in OEM DG, only 0.1uF cap mentioned in PCIe design. Can you try testing signals without these 0.22uF cap?

Dear Truman,

I will follow-up separately. Can we take this discussion offline via email? Want to share EYE plots of pre/post SerDes tuning of the switch and seeing massive improvement. Also will confirm your questions. Since you already have our schematics this discussion can move quickly. We are time-crunched for production!

I understand no one has requested SerDes Tuning before but for our design its mandatory.

Please connect offline - my email saransaund@gmail.com

Thanks

Saran

Truman,

Just verified (cannot mention specifics in public forum). Our Carrier Board uses 0.1uF AC capacitors btw switch and Tx1-modules, verified in the schematic, which meets the NVIDIA’s design guide document.

If you are referring to our schematic, the 0.22uF caps are used in PCIE redriver (from CPU PCIE ports) -> switch path and also used in filters for various power rails.

The AER error settings are hard to catch but will give it a shot.

Can we take this offline via email?

Thanks

Dear Vidyas,

Here’s response to your questions

-> What is the end point device in this case?

YC> IDT SW ports connected to Tx1 modules are the end point.

-> What kind of AER errors are we observing? (as in Correctable or UnCorrectable or both)?

YC> So far, only Correctable errors are being observed.

-> Do we have ASPM states (L0s / L1) enabled for the link in this case? (“lspci -vvvv” would tell us this)

YC> We confirmed ASPM was enabled.
-> Can we try disabling ASPM completely (add ‘pcie_aspm=off’ to kernel command line) and see if we still observe AER errors? (Over the time, we have observed that, there are many end points where ASPM-L0s/L1 is broken. So, its worth giving a try by disabling ASPM completely)

YC> We’re going to re-run the test with ASPM disabled soon.

YC> We’re going to re-run the test with ASPM disabled soon.
Any update on this?

Dear Vidhyas,

Turning off ASPM made matters worse spiking the AER errors. Instead we are now tuning TX1 registers with separate communications with Truman.

Thanks

Saran

Turning off ASPM made matters worse spiking the AER errors
I’m surprised to hear that. Would you mind sharing the method you used to turn ASPM off?

Absolutely. Give me couple days to gather the information from our validation team. Thanks!

Vidhyas.

Good news. The team re-tested with ASPM disabled on 24 TX1 modules over 24 hours. No AER errors reported. Agree it was strange that AER errors would increase with ASPM disabled. Next step is to test ASPM enabled and SSC turned on. Will be in touch regarding register tuning.