Hi NV team,
JP4.5,Xavier device encountered CPU-related errors:
Jul 3 04:07:58 wh5-105 kernel: [ 3458.590804] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2403 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590809] CPU4: RAS: FHI 479 detected
2404 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590811] CPU5: RAS: FHI 480 detected
2405 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590820] CPU2: RAS: FHI 477 detected
2406 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590823] CPU3: RAS: FHI 478 detected
2407 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590896] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2408 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590924] **************************************
2409 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590925] RAS Error in L2, ERRSELR_EL1=544:
2410 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590931] Status = 0xc5005006
2411 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590933] IERR = L2 MLC Correctable Error: 0x50
2412 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590935] SERR = Data value from associative memory: 0x6
2413 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590936] Correctable Error
2414 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590941] MISC0 = 0x0
2415 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590942] MISC1 = 0x0
2416 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590945] ADDR = 0x6000000000000000
2417 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590951] **************************************
2418 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590968] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2419 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590973] **************************************
2420 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590975] RAS Error in SCF:L3_3, ERRSELR_EL1=771:
2421 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590976] Status = 0x45007c0a
2422 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590978] IERR = L3 Correctable ECC Error: 0x7c
2423 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590979] SERR = Data value from producer: 0xa
2424 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590980] Correctable Error
2425 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590985] MISC0 = 0x1a7aa0000121000
2426 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590986] MISC1 = 0x0
2427 Jul 3 04:07:58 wh5-105 kernel: [ 3458.590990] **************************************
2428 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591007] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2429 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591096] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2430 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591140] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2431 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591161] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2432 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591248] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2433 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591292] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2434 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591313] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2435 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591401] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2436 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591445] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2437 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591466] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2438 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591551] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2439 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591594] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2440 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591613] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2441 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591701] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2442 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591745] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2443 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591764] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2444 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591851] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2445 Jul 3 04:07:58 wh5-105 kernel: [ 3458.591895] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2446 Jul 3 04:07:58 wh5-105 kernel: [ 3458.610329] handle_fhi_core: Scanning Core Error Records for Correctable Errors
2447 Jul 3 04:07:58 wh5-105 kernel: [ 3458.610421] handle_fhi_corecluster:Scanning CoreCluster Error Records for Correctable Errors
2448 Jul 3 04:07:58 wh5-105 kernel: [ 3458.610466] handle_fhi_ccplex: Scanning CCPLEX Error Records for Correctable Errors
2449 Jul 3 04:10:38 wh5-105 kernel: [ 3619.096154] FAN rising trip_level:3 cur_temp:72000 trip_temps[4]:81000
full log:
kern.log (604.6 KB)
We checked the SOM temperature when the problem occurred,and it seems to be within the specified range :
Please analyze the log file,thanks!