SError as in https://forums.developer.nvidia.com/t/debugging-serror/120938/6

from devs:

Sometime I have problem with CUDA FFT initialization. The problem occurs in one of about ten SW runs. In most cases, the initialization runs correctly.
The problem does not occur at all on the desktop. The problem is with the Jetson NX only.
The source code looks like this:

if (g_config.verbose) {
        printf("initialize FFT\n");
    }

    // Create a 1D FFT plan.
    checkCudaErrors(cufftPlanMany(&plan, NRANK, n,
                      NULL, 1, nx, // *inembed, istride, idist
                      NULL, 1, nx, // *onembed, ostride, odist
                      CUFFT_C2C, BATCH));

    if (g_config.verbose) {
        printf("CUDA initialized\n");
    }

And it shows this in log (serial):

initialize FFT
[   52.432201] CPU1: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000001, esr=be000000
[   52.432216] CPU4: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000200, esr=be000000
[   52.432232] CPU5: SError detected, daif=1c0, spsr=0x40000000, mpidr=80000201, esr=be000000
[   52.432241] CPU2: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000100, esr=be000000
[   52.432247] CPU3: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000101, esr=be000000
[   52.432255] CPU0: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000000, esr=be000000
[   52.432332] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   52.432362] **************************************
[   52.432368] RAS Error in SCF:SNOC, ERRSELR_EL1=1026:
[   52.432374]  Status = 0xfc00a20d
[   52.432381]  IERR = Uncorrectable Carveout  Error: 0xa2
[   52.432388]  SERR = Illegal address (software fault): 0xd
[   52.432394]  Overflow (there may be more errors) - Uncorrectable
[   52.432398]  Uncorrectable (this is fatal)
[   52.432409]  MISC0 = 0x1804
[   52.432413]  MISC1 = 0x833900000000
[   52.432421]  ADDR = 0x80000000c78e1460
[   52.432433] **************************************
[   52.432445] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   52.432491] **************************************
[   52.432496] RAS Error in L2, ERRSELR_EL1=544:
[   52.432500]  Status = 0xfc00640d
[   52.432516]  IERR = SCF to L2 Decode Error Read: 0x64
[   52.432540]  SERR = Illegal address (software fault): 0xd
[   52.432542]  Overflow (there may be more errors) - Uncorrectable
[   52.432543]  Uncorrectable (this is fatal)
[   52.432551]  MISC0 = 0x100000000100000
[   52.432553]  MISC1 = 0x40040000000
[   52.432558]  ADDR = 0x80000000c78e1460
[   52.432565] **************************************
[   52.432576] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   52.432677] Bad mode in Error handler detected on CPU4, code 0xbe000000 -- SError
[   52.432681] Kernel panic - not syncing: bad mode
[   52.432698] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           O    4.9.253-tegra #3
[   52.432700] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[   52.432703] Call trace:
[   52.432718] [<ffffff800808ba40>] dump_backtrace+0x0/0x198
[   52.432728] [<ffffff800808c004>] show_stack+0x24/0x30
[   52.432736] [<ffffff8008f62cfc>] dump_stack+0xa0/0xc4
[   52.432743] [<ffffff8008f5fda0>] panic+0x12c/0x2a8
[   52.432748] [<ffffff800808c894>] bad_mode+0x7c/0x80
[   52.432753] [<ffffff800808ca5c>] handle_serr+0x124/0x128
[   52.432757] [<ffffff8008082d98>] el1_serr+0xb0/0x144
[   52.432761] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   52.432769] [<ffffff8008ba22cc>] cpuidle_enter_state+0x84/0x380
[   52.432775] [<ffffff8008ba263c>] cpuidle_enter+0x34/0x48
[   52.432780] [<ffffff80081113bc>] call_cpuidle+0x44/0x70
[   52.432784] [<ffffff8008111738>] cpu_startup_entry+0x1b0/0x200
[   52.432790] [<ffffff8008091cf8>] secondary_start_kernel+0x190/0x1f8
[   52.432793] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   52.432797] [<0000000080f701a8>] 0x80f701a8
[   52.432804] SMP: stopping secondary CPUs
[   52.432845] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   52.433037] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   52.433068] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   52.433111] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   52.433278] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   52.433305] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   52.433348] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   52.433504] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   52.433534] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   52.433563] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   52.719414] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   52.727837] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   52.737279] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   52.745027] Kernel Offset: disabled
[   52.748903] Memory Limit: none
[   52.751877] trusty-log panic notifier - trusty version Built: 08:57:16 Feb 19 2022 [   52.770426] Rebooting in 5 seconds..
����Shutdown state requested 1
Rebooting system ...
��
[0000.024] W> RATCHET: MB1 binary ratchet value 4 is too large than ratchet level 2 from HW fuses.
[0000.033] I> MB1 (prd-version: 1.5.1.9-t194-41334769-73a9b7ef)
[0000.038] I> Boot-mode: Coldboot
[0000.041] I> Chip revision : A02P
[0000.044] I> Bootrom patch version : 15 (correctly patched)
[0000.049] I> ATE fuse revision : 0x200
[0000.053] I> Ram repair fuse : 0x0
[0000.056] I> Ram Code : 0x0
[0000.058] I> rst_source : 0xb
[0000.061] I> rst_level : 0x1
[0000.065] I> Boot-device: QSPI
[0000.067] I> Qspi flash params source = brbct
[0000.071] I> Qspi using bpmp-dma
[0000.074] I> Qspi clock source : pllp
[0000.078] I> QSPI Flash Size = 32 MB
[0000.081] I> Qspi initialized successfully
[0000.085] W> No valid slot number is found in scratch register
[0000.091] W> Return default slot: _a
[0000.094] I> Active Boot chain : 0
[0000.097] I> Boot-device: QSPI
[0000.100] I> Qspi flash params source = brbct
[0000.106] W> MB1_PLATFORM_CONFIG: device prod data is empty in MB1 BCT.
[0000.112] I> Temperature = 29000
[0000.115] W> Skipping boost for clk: BPMP_CPU_NIC
[0000.119] W> Skipping boost for clk: BPMP_APB
[0000.123] W> Skipping boost for clk: AXI_CBB
[0000.127] W> Skipping boost for clk: AON_CPU_NIC
[0000.132] W> Skipping boost for clk: CAN1
[0000.135] W> Skipping boost for clk: CAN2
[0000.140] I> Boot-device: QSPI
[0000.142] I> Boot-device: QSPI
[0000.145] I> Qspi flash params source = mb1bct
[0000.149] I> Qspi using bpmp-dma
[0000.152] I> Qspi clock source : pllc_out0
[0000.156] I> Qspi reinitialized
[0000.159] I> Qspi flash params source = mb1bct
[0000.164] I> ECC region[0]: Start:0x0, End:0x0
[0000.169] I> ECC region[1]: Start:0x0, End:0x0
[0000.173] I> ECC region[2]: Start:0x0, End:0x0
[0000.177] I> ECC region[3]: Start:0x0, End:0x0
[0000.181] I> ECC region[4]: Start:0x0, End:0x0
[0000.185] I> Non-ECC region[0]: Start:0x80000000, End:0x100000000
[0000.191] I> Non-ECC region[1]: Start:0x0, End:0x0
[0000.195] I> Non-ECC region[2]: Start:0x0, End:0x0
[0000.200] I> Non-ECC region[3]: Start:0x0, End:0x0
[0000.204] I> Non-ECC region[4]: Start:0x0, End:0x0
[0000.210] E> FAILED: Thermal config
[0000.217] E> FAILED: MEMIO rail config
[0000.227] I> Boot-device: QSPI
[0000.230] I> Qspi flash params source = mb1bct
[0000.239] I> Qspi flash params source = mb1bct
[0000.250] I> Qspi flash params source = mb1bct
[0000.317] I> Qspi flash params source = mb1bct
[0000.326] I> Qspi flash params source = mb1bct
[0000.357] I> Qspi flash params source = mb1bct
[0000.369] I> MB1 done

����main enter
SPE VERSION #: R01.00.14 Created: Sep 19 2018 @ 11:03:21
HW Function test
Start Scheduler.

any ideas what causes the errors in 1 of 10 runs?
similar issue mentioned here Debugging SError - #6 by sfalsig

similar issue found here

and here

the board is custom too with production module in our case

  1. Please provide the jetpack version you are using

  2. Are you using NV devkit to reproduce issue or it is some custom board?

  3. how did you reproduce this issue? Is this CUDA FFT the sample code or we must include your change to reproduce this issue?

the issue is observed on production nx module with custom carrier board
In my opinion, the latest version of Jetpack is used. From the error log we can see

 [   52.432698] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           O    4.9.253-tegra #3

Once there is more information from devs I shall update you with new inputs

If possible, please use NV devkit to reproduce this issue. Or at least share us your application which can reproduce this issue.

@WayneWWW
Thank you for following up
devs pointed out they will try on nx devkit too
regarding the version used

dpkg-query --show nvidia-l4t-core
nvidia-l4t-core	32.7.1-20220219090344

–from devs
they will prepare an app so you could test

The error looks to be coming due to accessing a carveout address “0xc78e1460”.
Can you add prints to check the addresses being accessed by pointers?

@sumitg Thank you for following up!
from devs

it seems, that the same code without FPGA interface runs correctly.
The SW is identical. Only the data are not sent froom FPGA by DMA interface. The data are read from binary file.
I tried the software, that read the data from binary file more than 650 times and it was initialized always correctly.
So it seems, that CUDA and Jetson are OK
"I am checking the FPGA firmware now.
The FPGA can acces whole Jetson memory.  So maybe we send some data to wrong memory region"
---from devs

@sumitg @WayneWWW
Thanks, guys
devs seem were able to resolve the reset issue somehow,

The FPGA sent some part of data outside the allocated memory buffer (CMA buffer).

but they got another concern raised. Maybe you could extend on it?

how do you compile a kernel with a large cma buffer ?
In jetsonNX
Is 1Gb possible?
dev says they can't get more than 384MB

I was able to create one by adding a Device tree node.
Please refer the link for more info: reserved-memory.txt - Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt - Linux source code (v4.9.309) - Bootlin

DT entry:
linux,cma {
compatible = “shared-dma-pool”;
reusable;
size = <0x0 0x40000000>;
alignment = <0x0 0x10000>;
status = “okay”;
};

Kernel Logs:
root@tegra-ubuntu:/home/ubuntu# dmesg | grep -i “CMA memory pool”
[ 0.000000] Reserved memory: created CMA memory pool at 0x0000000840000000, size 1024 MiB
root@tegra-ubuntu:/home/ubuntu# cat /proc/meminfo | grep -i cma
CmaTotal: 1802240 kB
CmaFree: 1735448 kB

@sumitg
Thank you very much! It should work.
I guess next question from devs will be how to reproduce your implementation by adding a Device tree node.
It requires to create a file then paste the DT entry above into it, then somehow build it on running OS? Could you extend on steps to repeat adding the device tree locally at our side, please?
AV

I added the node in file tegra194-soc-base.dtsi like below.

    reserved-memory {
            #address-cells = <2>;
            #size-cells = <2>;
            ranges;

            linux,cma {
                    compatible = "shared-dma-pool";
                    reusable;
                    size = <0x0 0x40000000>;
                    alignment = <0x0 0x10000>;
                    status = "okay";
            };

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.