Using kexec to boot into another OS runs into SError

I am using kexec from the OS on the eMMC to boot into an OS on another drive. The OS is able to mostly boot fine until it runs into the following panic:

[   15.568185] CPU3: SError detected, daif=140, spsr=0x60000000, mpidr=80000101, esr=be000000
[   15.568196] CPU0: SError detected, daif=140, spsr=0x0, mpidr=80000000, esr=be000000
[   15.568201] CPU1: SError detected, daif=140, spsr=0x0, mpidr=80000001, esr=be000000
[   15.568207] CPU2: SError detected, daif=140, spsr=0xc00045, mpidr=80000100, esr=be000000
[   15.568221] CPU5: SError detected, daif=1c0, spsr=0x80c000c5, mpidr=80000201, esr=be000000
[   15.568227] CPU4: SError detected, daif=140, spsr=0x80000000, mpidr=80000200, esr=be000000
[   15.568298] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   15.568319] **************************************
[   15.568321] RAS Error in SCF:SNOC, ERRSELR_EL1=1026:
[   15.568323]  Status = 0xf400a20d
[   15.568326]  IERR = Uncorrectable Carveout  Error: 0xa2
[   15.568328]  SERR = Illegal address (software fault): 0xd
[   15.568329]  Uncorrectable (this is fatal)
[   15.568336]  MISC0 = 0x804
[   15.568338]  MISC1 = 0xa10900000000
[   15.568343]  ADDR = 0x80000000c6000000
[   15.568349] **************************************
[   15.568355] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   15.568396] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   15.568495] Bad mode in Error handler detected on CPU0, code 0xbe000000 -- SError
[   15.568501] Internal error: Oops - bad mode: 0 [#1] PREEMPT SMP
[   15.568526] Modules linked in: nvgpu spidev tcp_bbr atkbd libps2 macvlan ip_tables x_tables
[   15.568560] CPU: 0 PID: 3390 Comm: systemd-udevd Not tainted 4.9.140-tegra #1
[   15.568562] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[   15.568563] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[   15.568568] task: ffffffc1f4571c00 task.stack: ffffffc1f3e50000
[   15.568572] PC is at 0x7f8a607c64
[   15.568574] LR is at 0x7f8a607c64
[   15.568578] pc : [<0000007f8a607c64>] lr : [<0000007f8a607c64>] pstate: 00000000
[   15.568579] sp : 0000007fecbcbfa0
[   15.568591] x29: 0000007fecbcbfa0
[   15.568591] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[   15.568598] x28: 000000555f104b30
x27: 0000007fecbcc044 x26: 0000000000000000
[   15.568622] x25: 000000000aba9500 x24: 000000555f0ca910
[   15.568633] x23: 0000000000000009 x22: 000000555f0d0990
[   15.568633] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[   15.568649] x21: 000000555f0d0ab0 x20: 0000000000000000
[   15.568665] x19: 000000555f0fd470 x18: 0000000000000000
[   15.568676] x17: 0000007f8a2b8080 x16: 0000007f8a706160
[   15.568688] x15: 766564752d357378 x14: 34386e6339666a66
[   15.568721] x13: 0000000000000001 x12: 0000000000004570
[   15.568727] x11: 0000000000000004 x10: 0101010101010101
[   15.568733] x9 : 0000000000000000 x8 : 000000000aba9500
[   15.568739] x7 : 0000002ad5641721 x6 : 0000007f8a73a668
[   15.568745] x5 : 0000000000000000 x4 : 0000000000000009
[   15.568751] x3 : 000000555f0ca910 x2 : 0000000000000000
[   15.568757] x1 : 9eee7d71e8236d00 x0 : 0000000000000000
[   15.568758]
[   15.568765] Process systemd-udevd (pid: 3390, stack limit = 0xffffffc1f3e50000)
[   15.568777] ---[ end trace 681eb15a7d93f758 ]---

Is kexec something that Nvidia’s L4T kernel supports?

Hi,

Can you share the exact command/parameters you use?

sudo kexec -l /mnt/sd/boot/Image --initrd=/mnt/sd/initrd --append="video=tegrafb no_console_suspend=1 earlycon=tegra_comb_uart,mmio32,0x0c168000 gpt usbcore.old_scheme_first=1 tegraid=19.1.2.0.0 maxcpus=6 boot.slot_suffix= boot.ratchetvalues=0.4.2 vpr_resize sdhci_tegra.en_boot_part_access=1 root=/dev/mmcblk1p1 rw rootwait rootfstype=ext4 init=/nix/store/iars30adqhnqqac45x0rd6c2qz4dzs5w-nixos-system-xavier-nx-quark-20.09post-git/init console=ttyTCU0,115200n8 loglevel=7 fbcon=map:0 net.ifnames=0"
sudo kexec -e --dtb=/mnt/sd/boot/tegra194-xavier-nx-cti-NGX004.dtb

This error happens after the kernel does insmod of nvgpu so maybe there is an issue there

Solved it, in the first OS I need to disable the nvgpu kernel module. When it tries to insmod it again it doesn’t work and it panics

1 Like