Handle_fhi_core: Scanning Core Error Records for Correctable Errors

Hi NV team,

JP4.5,Xavier device encountered CPU-related errors:

Jul  3 04:07:58 wh5-105 kernel: [ 3458.590804] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2403 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590809] CPU4: RAS: FHI 479 detected
2404 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590811] CPU5: RAS: FHI 480 detected
2405 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590820] CPU2: RAS: FHI 477 detected
2406 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590823] CPU3: RAS: FHI 478 detected
2407 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590896] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2408 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590924] **************************************
2409 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590925] RAS Error in L2, ERRSELR_EL1=544:
2410 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590931]  Status = 0xc5005006
2411 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590933]  IERR = L2 MLC Correctable Error: 0x50
2412 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590935]  SERR = Data value from associative memory: 0x6
2413 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590936]  Correctable Error
2414 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590941]  MISC0 = 0x0
2415 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590942]  MISC1 = 0x0
2416 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590945]  ADDR = 0x6000000000000000
2417 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590951] **************************************
2418 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590968] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2419 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590973] **************************************
2420 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590975] RAS Error in SCF:L3_3, ERRSELR_EL1=771:
2421 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590976]  Status = 0x45007c0a
2422 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590978]  IERR = L3 Correctable ECC Error: 0x7c
2423 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590979]  SERR = Data value from producer: 0xa
2424 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590980]  Correctable Error
2425 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590985]  MISC0 = 0x1a7aa0000121000
2426 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590986]  MISC1 = 0x0
2427 Jul  3 04:07:58 wh5-105 kernel: [ 3458.590990] **************************************
2428 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591007] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2429 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591096] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2430 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591140] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2431 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591161] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2432 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591248] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2433 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591292] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2434 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591313] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2435 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591401] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2436 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591445] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2437 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591466] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2438 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591551] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2439 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591594] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2440 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591613] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2441 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591701] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2442 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591745] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2443 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591764] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2444 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591851] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2445 Jul  3 04:07:58 wh5-105 kernel: [ 3458.591895] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2446 Jul  3 04:07:58 wh5-105 kernel: [ 3458.610329] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2447 Jul  3 04:07:58 wh5-105 kernel: [ 3458.610421] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2448 Jul  3 04:07:58 wh5-105 kernel: [ 3458.610466] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2449 Jul  3 04:10:38 wh5-105 kernel: [ 3619.096154] FAN rising trip_level:3 cur_temp:72000 trip_temps[4]:81000

full log:
kern.log (604.6 KB)

We checked the SOM temperature when the problem occurred,and it seems to be within the specified range :

Please analyze the log file,thanks!

How this issue happened?
The reproduce step?
On devkit or custom carrier board?

Hi kayccc,

This Jetpack version is too old and no specific method to reproduce it.

Maybe try these 3 patches to kernel and see if it bypass the crash.
6b12280.diff.zip (1.3 KB)
af707d0.diff.zip (2.1 KB)
30fadee.diff.zip (2.2 KB)

Hi Wayne,

OK, I will test whether the patch is effective.
Also, what is the cause of this problem?

Hi Wayne,

What is the cause of this problem?
Is it caused by excessively high ambient temperature?

Please try the patch first. If the patch didn’t work, then I don’t know the cause of your problem either.

This is also a old release so we may not look into it.

Hi Wayne,

Jetson R32.5.0,kerne 4.9 branch:l4t/l4t-r32.5-4.9

There are 3 patches below, 2 of which failed to be applied:
6b12280.diff.zip (1.3 KB)
af707d0.diff.zip (2.1 KB)
30fadee.diff.zip (2.2 KB)

patch 1 : af707d0.diff

kernel-4.9$ git apply af707d0.diff
error: drivers/platform/tegra/tegra_cbb.c: No such file or directory
error: include/linux/platform/tegra/tegra_cbb.h: No such file or directory

patch 2 : 30fadee.diff
kernel-4.9$ git apply 30fadee.diff
error: drivers/platform/tegra/tegra_cbb.c: No such file or directory

I checked the source code directory and found that the following two files do not exist:
tegra_cbb.c tegra_cbb.h

Steps to patch:

  1. Directory sources/kernel/nvidia$

First patch af707d0.diff, then patch 30fadee.diff

  1. Directory sources/kernel/kernel-4.9$

Patch 6b12280.diff

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.