Jetson TX2 (4GB) crashes after SError and Machine check error

cbaumann · May 11, 2023, 12:29pm

Hi Nvidia support,

we are running Jetson R32.7.2 on the TX2 and TX2 4GB modules with a custom carrier board. We are a bit stuck with a very similar issue to this thread. A couple of Jetson modules in the field (both TX2 and TX2 4GB) crash with errors similar to these ones (log from a TX2 4GB):

[141528.900489] CPU3: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000101, esr=bf40c000
[141528.900493] CPU5: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000103, esr=bf40c000
[141528.900495] CPU4: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000102, esr=bf40c000
[141528.900523] CPU2: SError detected, daif=1c0, spsr=0x80000000, mpidr=80000001, esr=be000000
[141528.900524] **************************************
[141528.900528] Machine check error in JSR:MTS:
[141528.900530] 	Status = 0xb400000000000001
[141528.900533] 	Unknown error: 0x1
[141528.900538] 	Uncorrected (this is fatal)
[141528.900539] 	Error reporting enabled when error arrived
[141528.900562] 	ADDR = 0x17e99c280
[141528.900564] **************************************
[141528.900571] CPU0: SError detected, daif=1c0, spsr=0x60000045, mpidr=80000100, esr=bf40c000
[141528.900727] **************************************
[141528.900729] Machine check error in JSR:MTS:
[141528.900732] 	Status = 0xb400000000000001
[141528.900733] 	Unknown error: 0x1
[141528.900735] 	Uncorrected (this is fatal)
[141528.900736] 	Error reporting enabled when error arrived
[141528.900760] 	ADDR = 0x17e99c280
[141528.900761] **************************************
[141528.900919] **************************************
[141528.900921] Machine check error in JSR:MTS:
[141528.900923] 	Status = 0xb400000000000001
[141528.900924] 	Unknown error: 0x1
[141528.900926] 	Uncorrected (this is fatal)
[141528.900927] 	Error reporting enabled when error arrived
[141528.900950] 	ADDR = 0x17e99c280
[141528.900951] **************************************
[141528.901212] **************************************
[141528.901214] Machine check error in JSR:MTS:
[141528.901216] 	Status = 0xb400000000000001
[141528.901217] 	Unknown error: 0x1
[141528.901219] 	Uncorrected (this is fatal)
[141528.901220] 	Error reporting enabled when error arrived
[141528.901243] 	ADDR = 0x17e99c280
[141528.901244] **************************************

With our configuration this triggers a kernel panic which reboots the module after which everything is fine again. This seems to happen very rarely (sometimes it doesn’t happen for a month) and at random times. We saw this both on TX2 and on TX2 4GB units. So far despite trying hard we were not able to reproduce this in a lab environment.

Do you think these are HW issues similar to this thread?

If yes how can we find out if a module is affected (we have >200 remote accessible modules in the field)?
If not can you suggest things to try out from the SW side to get more information or to avoid the problem?

Thank you very much in advance!

WayneWWW · May 12, 2023, 3:44am

Could you dump the full boot up log so that I can compare the MTS version in use on the board?

Also, do all the CPU cores get enabled or still have 2 cores disabled in your usecase?

cbaumann · May 12, 2023, 5:24am

Here is the boot log up to the kernel:
boot_log.txt (64.4 KB)

The two denver cores are still isolated in our application. But then we assign dedicated tasks to them using /sys/fs/cgroup/cpuset/.

WayneWWW · May 12, 2023, 5:25am

Could you try to not assign any task to the denver cores and see if the issue is still reproduced?

WayneWWW · May 12, 2023, 5:26am

Also, I need the full log of your board. Your log is not starting from beginning.

cbaumann · May 12, 2023, 5:40am

I don’t think we can spare 2 cores in our application. We could maybe try not to isolate them. Do you think that could help as well?

The log does start from the beginning: that’s all that is output on the serial port and the first message is “I2C command failed”. We are not using u-boot. But we boot into the kernel from cboot directly.

DaneLLL · May 12, 2023, 5:48am

Hi,
By default we have isolcpus=1-2 in extlinux.conf. Please try with default extlinux.conf. Would like to clarify if the error happens in default setting.

cbaumann · May 12, 2023, 5:59am

Yes we are using the isolcpus=1-2 kernel parameter as well as you can see in this line of the boot log:

[0003.352] I> Linux Cmdline: console=ttyS0,115200 androidboot.presilicon=true firmware_class.path=/etc/firmware root=/dev/mapper/crypt_root rw rootwait rootfstype=ext4 console=ttyS0,115200n8 isolcpus=1-2 video=tegrafb earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x1772e0000 gpt rootfs.slot_suffix= usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 no_console_suspend boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x175840000 sdhci_tegra.en_boot_part_access=1

So this is the default setting. But we are not using u-boot and therefore also no extlinux.conf.

WayneWWW · May 12, 2023, 6:00am

Hi,

Actually we are talking about this comment you are talking about.

We want no task is running on your denver cores.

cbaumann · May 12, 2023, 6:07am

Ah I see. We cannot do this in a large scale as we cannot spare the performance of two cores.

We may be able to do this in a small test environment with at most 8 devices. As this happens so rarely (and it seems also not on all devices) it would take several months to show the absence of the issue in such an environment.

system · June 21, 2023, 1:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TX2 module is freezed while I'm using it Jetson TX2 boot	8	55	February 25, 2025
Jetson TX2 - SError detected esr=bf40c000 followed by Machine check error in JSR:MTS Jetson TX2 kernel	17	860	January 3, 2023
CPU Errors on TX-2 Jetson TX2	3	1121	October 18, 2021
Jetson TX2 (4GB) crashes after SError (continued.) Jetson TX2 kernel	6	712	June 26, 2023
Jetson Check Error (SError) at startup Jetson TX2 boot	6	405	October 9, 2023
Machine Check Error (SError) at startup and during runtime Jetson TX2	3	1958	October 18, 2021
Jetson TX2 CPU errors when running chef-client sofware Jetson TX2	8	1354	October 18, 2021
SError, unable to boot Jetson TX2	2	1475	September 9, 2019
The TX2 Module suddenly crashes and black screen after being turned on for many days Jetson TX2	7	800	October 18, 2021
Jetson TX2 NX Serror Jetson TX2 board-design	8	893	September 27, 2021

Jetson TX2 (4GB) crashes after SError and Machine check error

Related topics