Crashing at boot sometimes

Hi,

Ok, sorry I didn’t notice you are using rel-28.x release.

Would you mind upgrading to rel-32?

sorry @WayneWWW but I am a little new to all this,

how do I do that? and also, why? is there any issue that should be solved in that release?

Hi boltzman,

Then it is my turn to ask why did you pick up rel-28.x release? Any concern to upgrade to k4.9?
Actually, most of our development and latest feature are on rel-32 instead of rel-28.

Also, you should know how to do the upgrade. It is just same as how you installed rel-28.x release.
You could download the BSP from either sdkmanager or tarball from download center.

Actually, I didn’t see any error in your previous log. According to your other posts, it looks like you are doing something over pcie. By coincidence, your device just hangs after pcie device probing is done.

Such hang (without any error) is probably the hardware design problem which the peripheral drains too much current.
Are you able to see such hang if you remove the pcie device?

Thanks a lot for your time @WayneWWW,

Yes, I am trying to do something over PCIe. Basically I have this custom board with a switch and some SSDs. After booting the TX2 I can’t get any device to show up by typing “lspci”:

nvidia@tegra-ubuntu:~$ lspci
00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1)
nvidia@tegra-ubuntu:~$

The switch and SSDs are powered on. I did peek the link status registers in the PCIe switch, and the TX2 - Switch link seems to have trained successfully. This indicates to me that the devices are running and well connected but that the power on sequence might be incorrect. That is why I was trying to reset the TX2 module so that maybe it would re-scan the PCIe bus and detect the Switch and SSDs.
Both the SSDs and the Switch are using the clock and reset from the TX2 module currently, so I don’t know what could be causing that…

I will look into upgrading the release, I just wanted to know if you had some particular reason for it.
Do you have any idea what could be causing this?
Also, any help on my other threads is much appreciated :) I am trying to do multiple things this one being the first in the pipeline.

As I said in previous comment, you could check whether this issue is causing by the ssd and switch first.

Such hang (without any error) is probably the hardware design problem which the peripheral drains too much current.
Are you able to see such hang if you remove the pcie device?

Yes you are right I forgot to reference that from your previous comment. I would be surprised if that was the problem since the TX2 module boots correctly during the normal boot sequence (which also turns on switch/ssds at some point). All these components also use different power rails… It could me a matter of timing I guess… but it would be very bad luck I think.

It is tricky to try because all the components are in the same system and I don’t have physical access to it at the moment, but I will try to see if I can switch of some components and try again. I will post again with my results.
Does the TX2 module have any method to identify a power outage like you describe? some SoC I worked with in the past have registers allowing “forensic” investigation after a failure. Anyway, if you know of any let me know.

Hi @WayneWWW,

I was finally able to do some tests.
Basically I turned off all the devices using the same power rails as the Jetson TX2, still the system hanged at start-up when using the power-button to reboot.
More importantly, I tried to boot it using the original device-tree from jetpack to see if I could get the PCIe devices enumerated. Still no device came up using lspci… All I see is this, which as far as I am aware is the root port.

00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1)

I am attaching the output from boot. In there I can see how nothing is found in port #2 and it is therefore disables, however the same is not said for port 0… but still I don’t see any devices… why is that? If there are no devices on either why do I not see the same message for both ports?

Also… looking at the P2597 B04 schematics, I see 4 lanes are used for PCIe port #0, but in the device tree I see the port configured as a 2-lane port… My custom board is also using 4 lanes, so I thought the origianl device tree would be good, but I should modify it right? Here is the PCIe portion from the original device tree

	pcie-controller@10003000 {
	compatible = "nvidia,tegra186-pcie";
	power-domains = <0xcc>;
	device_type = "pci";
	reg = <0x0 0x10003000 0x0 0x800 0x0 0x10003800 0x0 0x800 0x0 0x40000000 0x0 0x10000000>;
	reg-names = "pads", "afi", "cs";
	clocks = <0xd 0x4 0xd 0x3 0xd 0x261>;
	clock-names = "afi", "pcie", "clk_m";
	resets = <0xd 0x1 0xd 0x1d 0xd 0x1e>;
	reset-names = "afi", "pcie", "pciex";
	interrupts = <0x0 0x48 0x4 0x0 0x49 0x4>;
	interrupt-names = "intr", "msi";
	#interrupt-cells = <0x1>;
	interrupt-map-mask = <0x0 0x0 0x0 0x0>;
	interrupt-map = <0x0 0x0 0x0 0x0 0x1 0x0 0x48 0x4>;
	#stream-id-cells = <0x1>;
	bus-range = <0x0 0xff>;
	#address-cells = <0x3>;
	#size-cells = <0x2>;
	ranges = <0x82000000 0x0 0x10000000 0x0 0x10000000 0x0 0x1000 0x82000000 0x0 0x10001000 0x0 0x10001000 0x0 0x1000 0x82000000 0x0 0x10004000 0x0 0x10004000 0x0 0x1000 0x81000000 0x0 0x0 0x0 0x50000000 0x0 0x10000 0x82000000 0x0 0x50100000 0x0 0x50100000 0x0 0x7f00000 0xc2000000 0x0 0x58000000 0x0 0x58000000 0x0 0x28000000>;
	status = "okay";
	vddio-pexctl-aud-supply = <0xe>;
	linux,phandle = <0x76>;
	phandle = <0x76>;

	pci@1,0 {
		device_type = "pci";
		assigned-addresses = <0x82000800 0x0 0x10000000 0x0 0x1000>;
		reg = <0x800 0x0 0x0 0x0 0x0>;
		status = "okay";
		#address-cells = <0x3>;
		#size-cells = <0x2>;
		ranges;
		nvidia,num-lanes = <0x2>;
		nvidia,afi-ctl-offset = <0x110>;
	};

	pci@2,0 {
		device_type = "pci";
		assigned-addresses = <0x82001000 0x0 0x10001000 0x0 0x1000>;
		reg = <0x1000 0x0 0x0 0x0 0x0>;
		status = "disabled";
		#address-cells = <0x3>;
		#size-cells = <0x2>;
		ranges;
		nvidia,num-lanes = <0x1>;
		nvidia,afi-ctl-offset = <0x118>;
	};

	pci@3,0 {
		device_type = "pci";
		assigned-addresses = <0x82001800 0x0 0x10004000 0x0 0x1000>;
		reg = <0x1800 0x0 0x0 0x0 0x0>;
		status = "okay";
		#address-cells = <0x3>;
		#size-cells = <0x2>;
		ranges;
		nvidia,num-lanes = <0x1>;
		nvidia,afi-ctl-offset = <0x19c>;
	};

	prod-settings {
		#prod-cells = <0x3>;

		prod_c_pad {
			prod = <0xc8 0xffffffff 0x80b880b8 0xcc 0xffffffff 0x480b8>;
		};
	};
};

received.log (118.7 KB)

Anyway, an

Hi,

I think we should focus on the boot problem first. Obviously, this log is not a boot up failure case. Isn’t it?

As for your pcie, the default config jetapck is using is 4,0,1 means lane x4,0,x1. I am not sure why it becomes 2,1,1 there. Are you sure this is a default dtb?

Also, I notice something that may not be fatal here but you may still need to be careful.

nvidia@tegra-ubuntu:~$ [ 22.848377] tegradc 15210000.nvdisplay: sanitize_flip_args: WIN 3 invalid size:w=0,h=0,out_w=0,out_h=0

Such error is due to something wrong with RAM_CODE pin.

Please refer to table 85 Power-on Strapping Breakdown in TX2 OEM product design guide.

If UART1_TX or UART0_RTS are used in a design, they must not be driven or pulled high or low during power-on.
Violating this requirement can change the RAM_CODE strapping & result in functional failures

Hi @WayneWWW,
I installed jetpack 3.2.1 and got the device tree from here: 64_TX2/Linux_for_Tegra/kernel/dtb/tegra186-quill-p3310-1000-a00-00-base.dtb
After de-compiling I got the pci configuration above in my last post.
Should I be looking at another one?

Hi boltzman,

  1. Please note that there are always differences between your carrier board and devkit. It is possible to have problems if you just use the dtb for devkit directly.

tegra186-quill-p3310-1000-a00-00-base.dtb

  1. No need to mention you are using a TX2 dtb on TX2i module…
    TX2 → p3310
    TX2i → p3489

You are right, I will test with the TX2i right away.
From what I have been reading the fact I can actually see one line in “lspci” (the main bridge) probably means the stuff below it is also getting detected… i am just not sure why it wont show up in lspci.

If nothing comes from the lspci, probably a hardware design problem. But you should try to enable it in x4 lane first.

So I went throught the schematics again and I saw one difference which might be important. In our board, we are currently pulling PEXn_CLKREQ high. As far as I understand that signal is active low, and must be pulled low if the clock from PEXn_REFCLK+/-
Is that correct?

I had asked if devices in the PCIe tree can be fed another clock signal that the one from the TX2i module: External PCIe clock reference - #3 by vidyas
I never got a clear answer to that.

Yes. PEXn_CLKREQ is an active low signal and it affects PEXn_REFCLK
But, to avoid REFCLK’s dependency on CLKREQ signal, PCIe root port nodes (rather sub-nodes i.e. pci@1,0 / pci@2,0 / pci@3,0 ) in the device-tree can have “nvidia,disable-clock-request;” entry.

Tegra PCIe controller doesn’t support endpoints receiving clock from a different source than the root port itself.

Thanks @vidyas .
Just so I know, where could I have found this information? I don’t see it mentioned in any of the documents I use for reference… :|

So I made some progress I think…
I edited the device tree. I disabled all PCIe sub-nodes except pci@1,0 which I configured as x4.
I included the nvidia,disable-clock-request" to all PCIe sub-nodes.

Now when I boot my system, as before, I can see the modules PCIe root port, but nothing below it:

nvidia@tegra-ubuntu:~$ lspci
00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1)

However, now I can actually reboot, so “sudo reboot” does not produce a hang while booting as before.
After booting though, lspci does not give any result at all, not even the bridge that was displayed before… This same device-tree works fine in my devkit. Our board is really not that different from the devkit… we have a clock multiplexor to switch reference clocks and some other stuff but I have checked and they all seem to be configured right.

Any ideas? what could be causing that?
I am attaching:

  1. The log from the fresh boot. fresh_boot.log (100.9 KB)
  2. The detailed output of lspci before rebooting. lspci.log (4.2 KB)
  3. The log from the reboot reboot.log (104.7 KB)

Is your setup providing REFCLK to switch and endpoints from a different source other than the Jetson?
After fresh boot, Could you please try executing the following command and update your observations?
echo 1 > /sys/bus/pci/devices/0000:00:01.0/rescan
Can you please attach PCIe switch and its downstream devices to another host (like x86 for example) and share the output of ‘sudo lspci -vv’?

Hello,

so I think I figured it out. Basically our PCIe switch configuration was wrong, and we were configuring the port to which the Jetson is connected to as a normal downstream port, not as the upstream port. Apparently booting the Jetson while connected to a downstream port causes it to hang (it would be nice to have some more info in the log I guess).
I definatelly do not see any power-drop because of any surge in power consumption. I am powering the whole setup with a PSU on my desk and the current limit is never reach.

@vidyas To answer your question, I would like to feed an external clock to the Jetson, that would greatly simplify my architecture and perhaps allow me to access the SSDs from another Root (using NT port) simultaneously. I will probably connect the Jetson directly and use the NT port to connect the other Host, it is easier that way I think, but I might be back in the future with more questions about that.

I’m afraid Tegra doesn’t support independent REFCLK configuration i.e.the PCIe hierarchy that gets connected to Tegra must take REFCLK also from Tegra.