Hi Nvidia support,
we are running Jetson R32.7.2 on the TX2 and TX2 4GB modules with a custom carrier board. We are a bit stuck with a very similar issue to this thread. A couple of Jetson modules in the field (both TX2 and TX2 4GB) crash with errors similar to these ones (log from a TX2 4GB):
[141528.900489] CPU3: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000101, esr=bf40c000
[141528.900493] CPU5: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000103, esr=bf40c000
[141528.900495] CPU4: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000102, esr=bf40c000
[141528.900523] CPU2: SError detected, daif=1c0, spsr=0x80000000, mpidr=80000001, esr=be000000
[141528.900524] **************************************
[141528.900528] Machine check error in JSR:MTS:
[141528.900530] Status = 0xb400000000000001
[141528.900533] Unknown error: 0x1
[141528.900538] Uncorrected (this is fatal)
[141528.900539] Error reporting enabled when error arrived
[141528.900562] ADDR = 0x17e99c280
[141528.900564] **************************************
[141528.900571] CPU0: SError detected, daif=1c0, spsr=0x60000045, mpidr=80000100, esr=bf40c000
[141528.900727] **************************************
[141528.900729] Machine check error in JSR:MTS:
[141528.900732] Status = 0xb400000000000001
[141528.900733] Unknown error: 0x1
[141528.900735] Uncorrected (this is fatal)
[141528.900736] Error reporting enabled when error arrived
[141528.900760] ADDR = 0x17e99c280
[141528.900761] **************************************
[141528.900919] **************************************
[141528.900921] Machine check error in JSR:MTS:
[141528.900923] Status = 0xb400000000000001
[141528.900924] Unknown error: 0x1
[141528.900926] Uncorrected (this is fatal)
[141528.900927] Error reporting enabled when error arrived
[141528.900950] ADDR = 0x17e99c280
[141528.900951] **************************************
[141528.901212] **************************************
[141528.901214] Machine check error in JSR:MTS:
[141528.901216] Status = 0xb400000000000001
[141528.901217] Unknown error: 0x1
[141528.901219] Uncorrected (this is fatal)
[141528.901220] Error reporting enabled when error arrived
[141528.901243] ADDR = 0x17e99c280
[141528.901244] **************************************
With our configuration this triggers a kernel panic which reboots the module after which everything is fine again. This seems to happen very rarely (sometimes it doesn’t happen for a month) and at random times. We saw this both on TX2 and on TX2 4GB units. So far despite trying hard we were not able to reproduce this in a lab environment.
Do you think these are HW issues similar to this thread?
If yes how can we find out if a module is affected (we have >200 remote accessible modules in the field)?
If not can you suggest things to try out from the SW side to get more information or to avoid the problem?
Thank you very much in advance!