from devs:
Sometime I have problem with CUDA FFT initialization. The problem occurs in one of about ten SW runs. In most cases, the initialization runs correctly.
The problem does not occur at all on the desktop. The problem is with the Jetson NX only.
The source code looks like this:
if (g_config.verbose) {
printf("initialize FFT\n");
}
// Create a 1D FFT plan.
checkCudaErrors(cufftPlanMany(&plan, NRANK, n,
NULL, 1, nx, // *inembed, istride, idist
NULL, 1, nx, // *onembed, ostride, odist
CUFFT_C2C, BATCH));
if (g_config.verbose) {
printf("CUDA initialized\n");
}
And it shows this in log (serial):
initialize FFT
[ 52.432201] CPU1: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000001, esr=be000000
[ 52.432216] CPU4: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000200, esr=be000000
[ 52.432232] CPU5: SError detected, daif=1c0, spsr=0x40000000, mpidr=80000201, esr=be000000
[ 52.432241] CPU2: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000100, esr=be000000
[ 52.432247] CPU3: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000101, esr=be000000
[ 52.432255] CPU0: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000000, esr=be000000
[ 52.432332] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[ 52.432362] **************************************
[ 52.432368] RAS Error in SCF:SNOC, ERRSELR_EL1=1026:
[ 52.432374] Status = 0xfc00a20d
[ 52.432381] IERR = Uncorrectable Carveout Error: 0xa2
[ 52.432388] SERR = Illegal address (software fault): 0xd
[ 52.432394] Overflow (there may be more errors) - Uncorrectable
[ 52.432398] Uncorrectable (this is fatal)
[ 52.432409] MISC0 = 0x1804
[ 52.432413] MISC1 = 0x833900000000
[ 52.432421] ADDR = 0x80000000c78e1460
[ 52.432433] **************************************
[ 52.432445] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[ 52.432491] **************************************
[ 52.432496] RAS Error in L2, ERRSELR_EL1=544:
[ 52.432500] Status = 0xfc00640d
[ 52.432516] IERR = SCF to L2 Decode Error Read: 0x64
[ 52.432540] SERR = Illegal address (software fault): 0xd
[ 52.432542] Overflow (there may be more errors) - Uncorrectable
[ 52.432543] Uncorrectable (this is fatal)
[ 52.432551] MISC0 = 0x100000000100000
[ 52.432553] MISC1 = 0x40040000000
[ 52.432558] ADDR = 0x80000000c78e1460
[ 52.432565] **************************************
[ 52.432576] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[ 52.432677] Bad mode in Error handler detected on CPU4, code 0xbe000000 -- SError
[ 52.432681] Kernel panic - not syncing: bad mode
[ 52.432698] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G O 4.9.253-tegra #3
[ 52.432700] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[ 52.432703] Call trace:
[ 52.432718] [<ffffff800808ba40>] dump_backtrace+0x0/0x198
[ 52.432728] [<ffffff800808c004>] show_stack+0x24/0x30
[ 52.432736] [<ffffff8008f62cfc>] dump_stack+0xa0/0xc4
[ 52.432743] [<ffffff8008f5fda0>] panic+0x12c/0x2a8
[ 52.432748] [<ffffff800808c894>] bad_mode+0x7c/0x80
[ 52.432753] [<ffffff800808ca5c>] handle_serr+0x124/0x128
[ 52.432757] [<ffffff8008082d98>] el1_serr+0xb0/0x144
[ 52.432761] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[ 52.432769] [<ffffff8008ba22cc>] cpuidle_enter_state+0x84/0x380
[ 52.432775] [<ffffff8008ba263c>] cpuidle_enter+0x34/0x48
[ 52.432780] [<ffffff80081113bc>] call_cpuidle+0x44/0x70
[ 52.432784] [<ffffff8008111738>] cpu_startup_entry+0x1b0/0x200
[ 52.432790] [<ffffff8008091cf8>] secondary_start_kernel+0x190/0x1f8
[ 52.432793] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[ 52.432797] [<0000000080f701a8>] 0x80f701a8
[ 52.432804] SMP: stopping secondary CPUs
[ 52.432845] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[ 52.433037] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[ 52.433068] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[ 52.433111] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[ 52.433278] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[ 52.433305] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[ 52.433348] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[ 52.433504] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[ 52.433534] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[ 52.433563] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[ 52.719414] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[ 52.727837] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[ 52.737279] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[ 52.745027] Kernel Offset: disabled
[ 52.748903] Memory Limit: none
[ 52.751877] trusty-log panic notifier - trusty version Built: 08:57:16 Feb 19 2022 [ 52.770426] Rebooting in 5 seconds..
����Shutdown state requested 1
Rebooting system ...
��
[0000.024] W> RATCHET: MB1 binary ratchet value 4 is too large than ratchet level 2 from HW fuses.
[0000.033] I> MB1 (prd-version: 1.5.1.9-t194-41334769-73a9b7ef)
[0000.038] I> Boot-mode: Coldboot
[0000.041] I> Chip revision : A02P
[0000.044] I> Bootrom patch version : 15 (correctly patched)
[0000.049] I> ATE fuse revision : 0x200
[0000.053] I> Ram repair fuse : 0x0
[0000.056] I> Ram Code : 0x0
[0000.058] I> rst_source : 0xb
[0000.061] I> rst_level : 0x1
[0000.065] I> Boot-device: QSPI
[0000.067] I> Qspi flash params source = brbct
[0000.071] I> Qspi using bpmp-dma
[0000.074] I> Qspi clock source : pllp
[0000.078] I> QSPI Flash Size = 32 MB
[0000.081] I> Qspi initialized successfully
[0000.085] W> No valid slot number is found in scratch register
[0000.091] W> Return default slot: _a
[0000.094] I> Active Boot chain : 0
[0000.097] I> Boot-device: QSPI
[0000.100] I> Qspi flash params source = brbct
[0000.106] W> MB1_PLATFORM_CONFIG: device prod data is empty in MB1 BCT.
[0000.112] I> Temperature = 29000
[0000.115] W> Skipping boost for clk: BPMP_CPU_NIC
[0000.119] W> Skipping boost for clk: BPMP_APB
[0000.123] W> Skipping boost for clk: AXI_CBB
[0000.127] W> Skipping boost for clk: AON_CPU_NIC
[0000.132] W> Skipping boost for clk: CAN1
[0000.135] W> Skipping boost for clk: CAN2
[0000.140] I> Boot-device: QSPI
[0000.142] I> Boot-device: QSPI
[0000.145] I> Qspi flash params source = mb1bct
[0000.149] I> Qspi using bpmp-dma
[0000.152] I> Qspi clock source : pllc_out0
[0000.156] I> Qspi reinitialized
[0000.159] I> Qspi flash params source = mb1bct
[0000.164] I> ECC region[0]: Start:0x0, End:0x0
[0000.169] I> ECC region[1]: Start:0x0, End:0x0
[0000.173] I> ECC region[2]: Start:0x0, End:0x0
[0000.177] I> ECC region[3]: Start:0x0, End:0x0
[0000.181] I> ECC region[4]: Start:0x0, End:0x0
[0000.185] I> Non-ECC region[0]: Start:0x80000000, End:0x100000000
[0000.191] I> Non-ECC region[1]: Start:0x0, End:0x0
[0000.195] I> Non-ECC region[2]: Start:0x0, End:0x0
[0000.200] I> Non-ECC region[3]: Start:0x0, End:0x0
[0000.204] I> Non-ECC region[4]: Start:0x0, End:0x0
[0000.210] E> FAILED: Thermal config
[0000.217] E> FAILED: MEMIO rail config
[0000.227] I> Boot-device: QSPI
[0000.230] I> Qspi flash params source = mb1bct
[0000.239] I> Qspi flash params source = mb1bct
[0000.250] I> Qspi flash params source = mb1bct
[0000.317] I> Qspi flash params source = mb1bct
[0000.326] I> Qspi flash params source = mb1bct
[0000.357] I> Qspi flash params source = mb1bct
[0000.369] I> MB1 done
����main enter
SPE VERSION #: R01.00.14 Created: Sep 19 2018 @ 11:03:21
HW Function test
Start Scheduler.
any ideas what causes the errors in 1 of 10 runs?
similar issue mentioned here Debugging SError - #6 by sfalsig