Kexec triggers Kernel oops on boot

Tried with a Xavier NX Devkit SD-CARD but I’m pretty sure it happens with the production module and with the AGX too.

If I specify the dtb I’m getting this Kernel panic after kexec:

sudo kexec -l /boot/Image --dtb=/boot/tegra194-p3668-all-p3509-0000.dtb --reuse-cmdline --force

��WARNING: at platform/drivers/pg/pg-gpu-t194.c:185
��[  284.412661] kexec_core: Starting new kernel
[  284.451055] CPU1: shutdown
[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 4.9.140-tegra (buildbrain@mobile-u64-3357) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecf
bc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Thu Jun 25 21:22:12 PDT 2020
[    0.000000] Boot CPU: AArch64 Processor [4e0f0040]
[    0.000000] earlycon: tegra_comb_uart0 at MMIO32 0x000000000c168000 (options '')
[    0.000000] bootconsole [tegra_comb_uart0] enabled
[    0.000000] cma: Failed to reserve 64 MiB
[    0.000000] Kernel panic - not syncing: ERROR: Failed to allocate 0x1000 bytes below 0x0.
[    0.000000] 
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.140-tegra #1
[    0.000000] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[    0.000000] Call trace:
[    0.000000] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[    0.000000] [<ffffff800808c37c>] show_stack+0x24/0x30
[    0.000000] [<ffffff800845c7a0>] dump_stack+0x98/0xc0
[    0.000000] [<ffffff80081c1438>] panic+0x11c/0x298
[    0.000000] [<ffffff8009618268>] memblock_alloc_base+0x30/0x3c
[    0.000000] [<ffffff8009618284>] memblock_alloc+0x10/0x18
[    0.000000] [<ffffff8009606fc4>] early_pgtable_alloc+0x18/0x70
[    0.000000] [<ffffff8009607198>] paging_init+0x2c/0x7a4
[    0.000000] [<ffffff8009603f40>] setup_arch+0x204/0x604
[    0.000000] [<ffffff8009600858>] start_kernel+0x64/0x384
[    0.000000] [<ffffff8009600204>] __primary_switched+0x80/0x94

and If I reuse the dtb provided by cboot on the current boot, kernel loads fine but panics when nvgpu gets loaded:

sudo kexec -l /boot/Image  --reuse-cmdline --force

[    7.991162] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null)
[    8.076322] nvgpu: 17000000.gv11b          nvgpu_nvhost_syncpt_init:291  [INFO]  syncpt_unit_base 60000000 syncpt_unit_size 400000 size 100
0
[    8.076322] 
[    8.083720] CPU0: SError detected, daif=140, spsr=0x80000000, mpidr=80000000, esr=be000000
[    8.083725] CPU1: SError detected, daif=1c0, spsr=0x60c000c5, mpidr=80000001, esr=be000000
[    8.083730] CPU5: SError detected, daif=140, spsr=0x80400045, mpidr=80000201, esr=be000000
[    8.083735] CPU4: SError detected, daif=140, spsr=0x60400045, mpidr=80000200, esr=be000000
[    8.083742] CPU2: SError detected, daif=140, spsr=0x80c00045, mpidr=80000100, esr=be000000
[    8.083746] CPU3: SError detected, daif=140, spsr=0x20000000, mpidr=80000101, esr=be000000
[    8.083807] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[    8.083828] **************************************
[    8.083830] RAS Error in SCF:SNOC, ERRSELR_EL1=1026:
[    8.083832]  Status = 0xfc00a20d
[    8.083834]  IERR = Uncorrectable Carveout  Error: 0xa2
[    8.083836]  SERR = Illegal address (software fault): 0xd
[    8.083837]  Overflow (there may be more errors) - Uncorrectable
[    8.083838]  Uncorrectable (this is fatal)
[    8.083845]  MISC0 = 0x804
[    8.083847]  MISC1 = 0xa10900000000
[    8.083852]  ADDR = 0x80000000c6000000
[    8.083858] **************************************
[    8.083865] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[    8.083905] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[    8.083985] Bad mode in Error handler detected on CPU1, code 0xbe000000 -- SError
[    8.083989] Internal error: Oops - bad mode: 0 [#1] PREEMPT SMP
[    8.084003] Modules linked in: nvgpu bluedroid_pm ip_tables x_tables
[    8.084012] CPU: 1 PID: 347 Comm: kworker/u12:5 Not tainted 4.9.140-tegra #1
[    8.084014] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[    8.084029] Workqueue: events_unbound call_usermodehelper_exec_work
[    8.084032] task: ffffffc1f4c4aa00 task.stack: ffffffc1f4d58000
[    8.084040] PC is at bad_range+0x28/0x70
[    8.084043] LR is at bad_range+0x28/0x70
[    8.084046] pc : [<ffffff80081c9d48>] lr : [<ffffff80081c9d48>] pstate: 60c000c5
[    8.084048] sp : ffffffc1f4d5b6b0
[    8.084053] x29: ffffffc1f4d5b6b0 x28: 0000000000000008 
[    8.084055] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[    8.084061] x27: 00000000ffffff80 x26: 0000000000000003 
[    8.084065] x25: ffffffbf07b83e00 x24: ffffff800a08f1b8 
[    8.084072] x23: 0000000000000000 x22: ffffffbf07b83c20 
[    8.084076] x21: ffffffbf07b83c00 x20: ffffffbf07b83e00 
[    8.084081] x19: ffffff800a08efc0 x18: 0000000000000000 
[    8.084082] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[    8.084088] x17: 000000000000000e x16: 0000000000000000 
[    8.084092] x15: 0000000000000000 x14: 00000000000c8000 
[    8.084097] x13: 0000000000006db7 x12: 0000000000006db7 
[    8.084101] x11: ffffffffffffffff x10: ffffffffffffffff 
[    8.084107] x9 : 0000000000000000 x8 : ffffff800a08f1e8 
[    8.084112] x7 : 0000000000000000 x6 : 0000000000000000 
[    8.084116] x5 : 0000000000180000 x4 : 0000000000100000 
[    8.084121] x3 : 0000000000000001 x2 : 0000000000000000 
[    8.084124] x1 : 000000000026e0f8 
[    8.084125] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[    8.084128] x0 : 0000000000000000 

[    8.084133] Process kworker/u12:5 (pid: 347, stack limit = 0xffffffc1f4d58000)
[    8.084136] Call trace:
[    8.084139] [<ffffff80081c9d48>] bad_range+0x28/0x70
[    8.084145] [<ffffff80081cc608>] __rmqueue+0x118/0x718
[    8.084148] [<ffffff80081cdd88>] get_page_from_freelist+0x770/0xa58
[    8.084152] [<ffffff80081ce8e4>] __alloc_pages_nodemask+0xfc/0xd38
[    8.084158] [<ffffff800822f260>] allocate_slab+0xa8/0x4e8
[    8.084161] [<ffffff800822f6e8>] new_slab+0x48/0x88
[    8.084165] [<ffffff8008231a8c>] ___slab_alloc.constprop.34+0x2bc/0x4a0
[    8.084169] [<ffffff8008231cb8>] __slab_alloc.isra.27.constprop.33+0x48/0x60
[    8.084175] [<ffffff8008231f58>] kmem_cache_alloc+0x288/0x2c0
[    8.084181] [<ffffff80080b0d74>] copy_process.isra.5.part.6+0x3e4/0x1530
[    8.084184] [<ffffff80080b205c>] _do_fork+0xd4/0x460
[    8.084188] [<ffffff80080b2490>] kernel_thread+0x48/0x58
[    8.084191] [<ffffff80080d1344>] call_usermodehelper_exec_work+0x34/0xd0
[    8.084196] [<ffffff80080d4ebc>] process_one_work+0x1e4/0x4b0
[    8.084200] [<ffffff80080d51d8>] worker_thread+0x50/0x4c8
[    8.084204] [<ffffff80080dbe64>] kthread+0xec/0xf0
[    8.084209] [<ffffff80080838a0>] ret_from_fork+0x10/0x30
[    8.084212] CPU5: SError detected, daif=140, spsr=0x80400045, mpidr=80000201, esr=be000000
[    8.084215] ---[ end trace b72d14ba5a5ce893 ]---
[    8.085562] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[    8.085584] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[    8.085625] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[    8.085766] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[    8.085790] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[    8.085832] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[    8.085970] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[    8.085992] ras_corecluster_serr_callback:Scanning CoreCluster Error Records for Uncorrectable Errors
[    8.086032] ras_core_serr_callback: Scanning Core Error Records for Uncorrectable Errors
[    8.086115] CPU4: SError detected, daif=1c0, spsr=0xa0c000c5, mpidr=80000200, esr=be000000
[    8.086120] CPU2: SError detected, daif=140, spsr=0x80c00045, mpidr=80000100, esr=be000000
[    8.086179] CPU3: SError detected, daif=140, spsr=0x40400045, mpidr=80000101, esr=be000000
[    8.086208] ras_ccplex_serr_callback: Scanning CCPLEX Error Records for Uncorrectable Errors
[    8.086226] **************************************
[    8.086228] RAS Error in SCF:SNOC, ERRSELR_EL1=1026:
[    8.086230]  Status = 0xfc00a20d
[    8.086232]  IERR = Uncorrectable Carveout  Error: 0xa2
[    8.086234]  SERR = Illegal address (software fault): 0xd
[    8.086236]  Overflow (there may be more errors) - Uncorrectable
[    8.086237]  Uncorrectable (this is fatal)
[    8.086243]  MISC0 = 0x804
[    8.086245]  MISC1 = 0x3a10900000000
[    8.086249]  ADDR = 0x80000000c6000080
[    8.086254] **************************************

I was expecting for a dtb generated by jetson-io, like for instance tegra194-p3668-all-p3509-0000-adafruit-sph0645lm4h.dtb, to work when loaded with kexec but it looks like it doesn’t.

Would be extremely helpful to have these issues fixed so we can have kexec working. Thank you

Hi,

Sorry that we may not support kexec. You can try to use alternative method to replace image and dtb.

I have a fix here for the second error that you are running into.

Would it be possible to share a link to the fix that works for you @rdesai1 ? Thank you

I was able to do it in a custom OS so not sure how to do it with nvidia’s ubuntu, but basically nvgpu can’t be loaded by the first OS booted by the bootloader or else when the second OS that you kexec into will panic. This seems like a bug with the nvgpu module that it doesn’t clean itself correctly. But who knows if that will ever be fixed