Jatson xavier failed to boot when flashed self compile kernel

Hi,

Recently we got a new Jatson AGX Xavier. We are planning to use this as GEN4 Root Complex for testing one of internal end-device which is also GEN4 capable. Initial unboxing and setup of Jatson AGX Xavier went smooth. We were able to boot pre-installed Ubuntu with Linux4.9 running on it.

When we tried to use our PCIe endpoint (GEN4 capable), we are facing below issues using Jatson AGX Xavier as root-complex:
1: When we connect PCIe end-device (Powered by host) to PCIe x16 slot, we observe that host (Jatson AGX Xavier) device failed to boot.
2: When we connect PCIe end-device (Externally powered) to PCIe x16 slot, we observe that Jatson device boots, but linkup happens only in GEN1 and it is unstable, lots of AER error logs can be seen in console of host (Jatson AGX Xavier)

When we searched for similar problem in this forum we came across below link:
https://devtalk.nvidia.com/default/topic/1042559/jetson-agx-xavier/no-pcie-link-with-some-devices/

So we decided to try this option by compiling the kernel by ourselves.

We downloaded kernel from
https://developer.nvidia.com/embedded/linux-tegra (for Jetson AGX Xavier and TX2)

When we followed steps given in below link:

First we faced issue in compilation itself, compilation was failing in multiple files (in drivers/net/), somehow we fixed those and compiled the kernel.

When we tried to boot this compiled kernel we observed that Jatson AGX Xavier failed to boot. Only initial bootscreen (NVIDIA) is getting displayed repeatedly.

We request NVIDIA team to help us:
1: Is there any specific requirement for end-device in terms of power consumption?
2: The kernel downloaded do not have git, so we are not sure if the source code is proper or not. Is it possible to get proper kernel which when compiled and flashed can boot Jatson device.

Is there any suggestion to resolve PCIe linkup issue? or boot issue?

Thanks,
Pankaj Dubey

Have a check this document for customize kernel.

https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide%2Fkernel_custom.html%23

Thanks for reply. Meanwhile we are not able to compile the kernel and boot AGX Xavier using our compiled kernel. But we are seeing PCIe issues even with the latest kernel as below:

1: When we connect our PCIe end-device (Powered by host) to PCIe x16 slot, we observe that host (Jatson AGX Xavier) device failed to boot.
2: When we connect our PCIe end-device (but this time Externally powered) to PCIe x16 slot, we observe that Jatson device boots, but linkup happens only in GEN1 and it is unstable, lots of AER error logs can be seen in console of host (Jatson AGX Xavier). We tried to disable AER on host side to check link status, so what we see that linkup was happened but soon, device goes off the bus.

We have verified our end-device (powered by host or externally powered) on GEN3 capable ARM64 based and X86 based hosts and it works fine on those machine. We do see linkup and enumeration happens on those machines.

Since our end-device is GEN4 capable we wanted to test this on Jatson AGX Xavier, but with Xavier host we do see issues in linkup.

So we would like - Is there any specific requirement for end-device in terms of power consumption, clock requirements?

>> 1: When we connect our PCIe end-device (Powered by host) to PCIe x16 slot, we observe that host (Jatson AGX Xavier) device failed to boot.
Can we have logs for this case? or you don’t get any boot logs at all?
The power that can be drawn from Jetson-AGX PCIe slot is 75W. I’ll get back to you on the exact number soon.
Is your endpoint device drawing more than this power?

>> 2: When we connect our PCIe end-device (but this time Externally powered) to PCIe x16 slot, we observe that Jatson device boots, but linkup happens only in GEN1 and it is unstable, lots of AER error logs can be seen in console of host (Jatson AGX Xavier). We tried to disable AER on host side to check link status, so what we see that linkup was happened but soon, device goes off the bus.
Is the endpoint same here except that it now doesn’t take power from Jetson-Xavier slot but from a different external source?
Do you have CLKREQ signal routing from endpoint to host and endpoint pulls it low to request REFCLK?
If not, can you please add “nvidia,disable-clock-request” entry to the PCIe controller node “pcie@141a0000” and check?
Also, you can play around setting “nvidia,max-speed = <3>;” in the same node to see if the link is stable at Gen-3 speed (or maybe Gen-2 or Gen-1 by setting ‘2’ and ‘1’ respectively).