Jetson TX2 (4GB) crashes after SError and Machine check error

Hi Nvidia support,

we are running Jetson R32.7.2 on the TX2 and TX2 4GB modules with a custom carrier board. We are a bit stuck with a very similar issue to this thread. A couple of Jetson modules in the field (both TX2 and TX2 4GB) crash with errors similar to these ones (log from a TX2 4GB):

[141528.900489] CPU3: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000101, esr=bf40c000
[141528.900493] CPU5: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000103, esr=bf40c000
[141528.900495] CPU4: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000102, esr=bf40c000
[141528.900523] CPU2: SError detected, daif=1c0, spsr=0x80000000, mpidr=80000001, esr=be000000
[141528.900524] **************************************
[141528.900528] Machine check error in JSR:MTS:
[141528.900530] 	Status = 0xb400000000000001
[141528.900533] 	Unknown error: 0x1
[141528.900538] 	Uncorrected (this is fatal)
[141528.900539] 	Error reporting enabled when error arrived
[141528.900562] 	ADDR = 0x17e99c280
[141528.900564] **************************************
[141528.900571] CPU0: SError detected, daif=1c0, spsr=0x60000045, mpidr=80000100, esr=bf40c000
[141528.900727] **************************************
[141528.900729] Machine check error in JSR:MTS:
[141528.900732] 	Status = 0xb400000000000001
[141528.900733] 	Unknown error: 0x1
[141528.900735] 	Uncorrected (this is fatal)
[141528.900736] 	Error reporting enabled when error arrived
[141528.900760] 	ADDR = 0x17e99c280
[141528.900761] **************************************
[141528.900919] **************************************
[141528.900921] Machine check error in JSR:MTS:
[141528.900923] 	Status = 0xb400000000000001
[141528.900924] 	Unknown error: 0x1
[141528.900926] 	Uncorrected (this is fatal)
[141528.900927] 	Error reporting enabled when error arrived
[141528.900950] 	ADDR = 0x17e99c280
[141528.900951] **************************************
[141528.901212] **************************************
[141528.901214] Machine check error in JSR:MTS:
[141528.901216] 	Status = 0xb400000000000001
[141528.901217] 	Unknown error: 0x1
[141528.901219] 	Uncorrected (this is fatal)
[141528.901220] 	Error reporting enabled when error arrived
[141528.901243] 	ADDR = 0x17e99c280
[141528.901244] **************************************

With our configuration this triggers a kernel panic which reboots the module after which everything is fine again. This seems to happen very rarely (sometimes it doesn’t happen for a month) and at random times. We saw this both on TX2 and on TX2 4GB units. So far despite trying hard we were not able to reproduce this in a lab environment.

Do you think these are HW issues similar to this thread?

If yes how can we find out if a module is affected (we have >200 remote accessible modules in the field)?
If not can you suggest things to try out from the SW side to get more information or to avoid the problem?

Thank you very much in advance!

1 Like

Could you dump the full boot up log so that I can compare the MTS version in use on the board?

Also, do all the CPU cores get enabled or still have 2 cores disabled in your usecase?

Here is the boot log up to the kernel:
boot_log.txt (64.4 KB)

The two denver cores are still isolated in our application. But then we assign dedicated tasks to them using /sys/fs/cgroup/cpuset/.

Could you try to not assign any task to the denver cores and see if the issue is still reproduced?

Also, I need the full log of your board. Your log is not starting from beginning.

I don’t think we can spare 2 cores in our application. We could maybe try not to isolate them. Do you think that could help as well?

The log does start from the beginning: that’s all that is output on the serial port and the first message is “I2C command failed”. We are not using u-boot. But we boot into the kernel from cboot directly.

Hi,
By default we have isolcpus=1-2 in extlinux.conf. Please try with default extlinux.conf. Would like to clarify if the error happens in default setting.

Yes we are using the isolcpus=1-2 kernel parameter as well as you can see in this line of the boot log:

[0003.352] I> Linux Cmdline: console=ttyS0,115200 androidboot.presilicon=true firmware_class.path=/etc/firmware root=/dev/mapper/crypt_root rw rootwait rootfstype=ext4 console=ttyS0,115200n8 isolcpus=1-2 video=tegrafb earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x1772e0000 gpt rootfs.slot_suffix= usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 no_console_suspend boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x175840000 sdhci_tegra.en_boot_part_access=1

So this is the default setting. But we are not using u-boot and therefore also no extlinux.conf.

Hi,

Actually we are talking about this comment you are talking about.

We want no task is running on your denver cores.

Ah I see. We cannot do this in a large scale as we cannot spare the performance of two cores.

We may be able to do this in a small test environment with at most 8 devices. As this happens so rarely (and it seems also not on all devices) it would take several months to show the absence of the issue in such an environment.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.