Debugging SError

I’m trying to debug a set of SErrors in the kernel log - such as:

[ 3641.615008] CPU0: SError detected, daif=1c0, spsr=0x200000c5, mpidr=80000100, esr=bf000002
[ 3641.615010] CPU3: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000101, esr=bf40c000
[ 3641.615012] CPU5: SError detected, daif=1c0, spsr=0x200000c5, mpidr=80000103, esr=bf40c000
[ 3641.615032] CPU2: SError detected, daif=1c0, spsr=0x800000c5, mpidr=80000001, esr=be000000
[ 3641.615039] CPU1: SError detected, daif=140, spsr=0x20000000, mpidr=80000000, esr=be000000
[ 3641.615113] CPU4: SError detected, daif=140, spsr=0x20000000, mpidr=80000102, esr=bf40c000
[ 3641.615141] CPU0: SError detected, daif=1c0, spsr=0x200000c5, mpidr=80000100, esr=bf40c000

I’m currently mainly trying to figure out what the esr register values mean. I’ve found a description in the ARM Architecture Reference Manual, but from what I can see, all the register values from the A57 cores have the “Implementation Defined” bit set (bit [25]), and the ARM manual doesn’t provide any further information on that.

I’ve had a look through the TX2/Parker TRM also, but can’t find anything related to the contents of that register.

Anyone who can point me in the correct direction?

Hi,

I would rather suggest to check the Address which has caused the error. Mostly, unprivileged/unmapped address or unpowered/unclocked registers.
Please attach the complete log.

thanks
Bibek

  • ESR, bit [25] is: IL - Instruction Length for synchronous exceptions and not “Implementation Defined”.
  • “ISS encoding for an SError interrupt” bit [24] is: IDS - IMPLEMENTATIONDEFINED
  • You will get more debug info from implementation-defined registers in kernel logs after the SError prints.

Thanks! Where do I find the address? I’ve attached a full dmesg log (dmesg.log (65.6 KB)). Even though I get a BUG and some backtraces, that to me seems to be because the SError handler triggers some functions to be run from an unsupported context that those specific cores just happen to be in?
Also, for reference, the exact same code is running without these issues on a lot of other boards though, so I suspect this to be a hardware issue. Any hint on which part of the TX2 is causing these errors would be very useful.

Ah yes, of course - I meant bit [24].

Not sure I’m seeing that though? See comment above.

Please make below change and check if you get more info about SError.
Also, it seems you are using “PREEMPT RT” on K4.4. We are not supporting RT on 4.4 anymore. IT will be better to switch to release-32 which has Kernel 4.9 and RT also supported.

File: arch/arm64/kernel/traps.c
Change:
@@ -518,7 +518,7 @@ asmlinkage void __exception handle_serr(unsigned long daif, unsigned long spsr,

    pr_crit("CPU%d: SError detected, daif=%lx, "
            "spsr=0x%lx, mpidr=%lx, esr=%lx\n",
   (old)               smp_processor_id(), daif, spsr, mpidr, esr);
   (new)             raw_smp_processor_id(), daif, spsr, mpidr, esr);

@sumitg
Could you add on resolving alike issue SError as in https://forums.developer.nvidia.com/t/debugging-serror/120938/6 - #2 by _av?