Kernel panic when starting a CUDA application as a service

gavin.lofts · June 14, 2021, 3:39pm

I am using the Jetson TX2 development kit and L4T 32 (I am using L4T 32.4, but I have also tried my test case on L4T 32.5.1).

I found that if I enable a service for systemd to run my CUDA application, I get a panic on bootup, I have included the kernel log below. If I add a 15s delay, everything works as expected.

I have made a test case to reproduce this problem with a CUDA sample to (hopefully) simplify understanding:

flash dev kit with L4T 32.5.1
install cuda-samples-10-2
make samples/1_Utilities/deviceQuery
Write this systemd file:
/etc/systemd/system/devicequery.service:

[Unit]
Description=CUDA test

[Service]
ExecStart=/usr/local/cuda-10.2/samples/1_Utilities/deviceQuery/deviceQuery
User=minit
Group=minit

[Install]
WantedBy=multi-user.target

Enable the service sudo systemctl enable devicequery
reboot

If I add a 15s delay by adding this line to my service:
ExecStartPre=/bin/sleep 15
everything works as expected.

Kernel log

[    3.362574] podgov: can't create debugfs directory
[    3.367919] Kernel panic - not syncing: nvhost_scale_emc_debug_init
[    3.374186] CPU: 5 PID: 4575 Comm: deviceQuery Not tainted 4.9.201-tegra #1
[    3.381136] Hardware name: quill (DT)
[    3.384789] Call trace:
[    3.387236] [<ffffff800808b9f8>] dump_backtrace+0x0/0x198
[    3.392627] [<ffffff800808bfbc>] show_stack+0x24/0x30
[    3.397671] [<ffffff800845abe8>] dump_stack+0xa0/0xc8
[    3.402715] [<ffffff80081c0a00>] panic+0x12c/0x2a8
[    3.407510] [<ffffff8008cbe228>] nvhost_scale_emc_debug_init.isra.12+0x128/0x1a0
[    3.414897] [<ffffff8008cbe5d4>] nvhost_pod_event_handler+0x334/0x400
[    3.421327] [<ffffff8008cbb4fc>] devfreq_add_device+0x284/0x408
[    3.427237] [<ffffff8008cbb6e4>] devm_devfreq_add_device+0x64/0xc0
[    3.433588] [<ffffff8000fc1080>] gk20a_scale_init+0xf0/0x190 [nvgpu]
[    3.440112] [<ffffff8000fba2f8>] gk20a_pm_finalize_poweron+0x370/0x400 [nvgpu]
[    3.447505] [<ffffff8000fba540>] gk20a_busy+0x1b8/0x4f0 [nvgpu]
[    3.453421] [<ffffff800878becc>] pm_generic_runtime_resume+0x3c/0x58
[    3.459765] [<ffffff800878e214>] __rpm_callback+0x74/0xa0
[    3.465154] [<ffffff800878e274>] rpm_callback+0x34/0x98
[    3.470369] [<ffffff800878f710>] rpm_resume+0x470/0x710
[    3.475585] [<ffffff800878f9fc>] __pm_runtime_resume+0x4c/0x70
[    3.481585] [<ffffff8000fba45c>] gk20a_busy+0xd4/0x4f0 [nvgpu]
[    3.487589] [<ffffff8000f9bf1c>] gk20a_ctrl_dev_open+0x8c/0x168 [nvgpu]
[    3.494196] [<ffffff8008261f6c>] chrdev_open+0x94/0x198
[    3.499413] [<ffffff8008258918>] do_dentry_open+0x1d8/0x340
[    3.504977] [<ffffff8008259ed0>] vfs_open+0x58/0x88
[    3.509847] [<ffffff800826d3b0>] do_last+0x530/0xfd0
[    3.514806] [<ffffff800826dee0>] path_openat+0x90/0x378
[    3.520021] [<ffffff800826f450>] do_filp_open+0x70/0xe8
[    3.525237] [<ffffff800825a394>] do_sys_open+0x174/0x258
[    3.530540] [<ffffff800825a4fc>] SyS_openat+0x3c/0x50
[    3.535589] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[    3.540893] SMP: stopping secondary CPUs
[    3.544818] Kernel Offset: disabled
[    3.548298] Memory Limit: none
[    3.551346] trusty-log panic notifier - trusty version Built: 08:40:58 Feb 19 2021 [    3.563195] Rebooting in 5 seconds..

Could you tell me what I need to wait for to prevent this panic?

gavin.lofts · June 23, 2021, 6:39am

Hello kayccc,

This is still an issue for me. I am able to work around by delaying my application, but it’s not great to delay my application by 15s. I posted here as this problem seems to be specific to TX2 and the TX2 CUDA driver.

Thanks,

Gavin

neel_patel · June 23, 2021, 6:44am

Hi Gavin,

we are investigating the issue. I suspect that the nvhost needs to initialise and thats what is causing this, but need to confirm.

gavin.lofts · June 23, 2021, 6:49am

Thank you neel_patel.

neel_patel · June 23, 2021, 7:29pm

Hi Gavin ,
Can you share complete log ? that will help us to identify the branch you are using and the instrumentation you have done.

Thanks

gavin.lofts · June 24, 2021, 1:01pm

The L4T is unchanged from here: https://developer.nvidia.com/embedded/linux-tegra

Here’s the log:

SoC: tegra186
Model: NVIDIA P2771-0000-500
Board: NVIDIA P2771-0000
DRAM:  7.8 GiB
MMC:   sdhci@3400000: 1, sdhci@3460000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Net:   
Warning: ethernet@2490000 using MAC address from ROM
eth0: ethernet@2490000
Hit any key to stop autoboot:  0 
MMC: no card present
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
Found /boot/extlinux/extlinux.conf
Retrieving file: /boot/extlinux/extlinux.conf
858 bytes read in 24 ms (34.2 KiB/s)
1:	primary kernel
Retrieving file: /boot/initrd
7236840 bytes read in 186 ms (37.1 MiB/s)
Retrieving file: /boot/Image
34338824 bytes read in 824 ms (39.7 MiB/s)
append: console=ttyS0,115200 androidboot.presilicon=true firmware_class.path=/etc/firmware root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=1-2  video=tegrafb no_console_suspend=1 earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x2772e0000 gpt rootfs.slot_suffix= usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x275840000 sdhci_tegra.en_boot_part_access=1 quiet root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=1-2 
## Flattened Device Tree blob at 80000000
   Booting using the fdt blob at 0x80000000
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
   Using Device Tree in place at 0000000080000000, end 0000000080060699
copying carveout for /host1x@13e00000/display-hub@15200000/display@15200000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15210000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15220000...

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x100
[    0.000000] Linux version 4.9.201-tegra (buildbrain@mobile-u64-5285-d7000) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Fri Feb 19 08:42:04 PST 2021
[    0.000000] Boot CPU: AArch64 Processor [411fd073]
[    0.000000] OF: fdt:memory scan node memory@80000000, reg size 80,
[    0.000000] OF: fdt: - 80000000 ,  70000000
[    0.000000] OF: fdt: - f0200000 ,  185600000
[    0.000000] OF: fdt: - 275e00000 ,  200000
[    0.000000] OF: fdt: - 276600000 ,  200000
[    0.000000] OF: fdt: - 277000000 ,  200000
[    0.000000] earlycon: uart8250 at MMIO32 0x0000000003100000 (options '')
[    0.000000] bootconsole [uart8250] enabled
[    1.923721] cgroup: cgroup2: unknown option "nsdelegate"
[    3.148498] using random self ethernet address
[    3.181618] using random host ethernet address
[    3.392964] podgov: can't create debugfs directory
[    3.407877] Kernel panic - not syncing: nvhost_scale_emc_debug_init
[    3.412642] random: crng init done
[    3.412644] random: 7 urandom warning(s) missed due to ratelimiting
[    3.423790] CPU: 0 PID: 4767 Comm: deviceQuery Not tainted 4.9.201-tegra #1
[    3.430740] Hardware name: quill (DT)
[    3.434394] Call trace:
[    3.436848] [<ffffff800808b9f8>] dump_backtrace+0x0/0x198
[    3.436854] [<ffffff800808bfbc>] show_stack+0x24/0x30
[    3.436858] [<ffffff800845abe8>] dump_stack+0xa0/0xc8
[    3.436863] [<ffffff80081c0a00>] panic+0x12c/0x2a8
[    3.436869] [<ffffff8008cbe228>] nvhost_scale_emc_debug_init.isra.12+0x128/0x1a0
[    3.436872] [<ffffff8008cbe5d4>] nvhost_pod_event_handler+0x334/0x400
[    3.436874] [<ffffff8008cbb4fc>] devfreq_add_device+0x284/0x408
[    3.436876] [<ffffff8008cbb6e4>] devm_devfreq_add_device+0x64/0xc0
[    3.437063] [<ffffff8000fc1080>] gk20a_scale_init+0xf0/0x190 [nvgpu]
[    3.437233] [<ffffff8000fba2f8>] gk20a_pm_finalize_poweron+0x370/0x400 [nvgpu]
[    3.437393] [<ffffff8000fba540>] gk20a_busy+0x1b8/0x4f0 [nvgpu]
[    3.437398] [<ffffff800878becc>] pm_generic_runtime_resume+0x3c/0x58
[    3.437401] [<ffffff800878e214>] __rpm_callback+0x74/0xa0
[    3.437402] [<ffffff800878e274>] rpm_callback+0x34/0x98
[    3.437404] [<ffffff800878f710>] rpm_resume+0x470/0x710
[    3.437405] [<ffffff800878f9fc>] __pm_runtime_resume+0x4c/0x70
[    3.437570] [<ffffff8000fba45c>] gk20a_busy+0xd4/0x4f0 [nvgpu]
[    3.437724] [<ffffff8000f9bf1c>] gk20a_ctrl_dev_open+0x8c/0x168 [nvgpu]
[    3.437727] [<ffffff8008261f6c>] chrdev_open+0x94/0x198
[    3.437730] [<ffffff8008258918>] do_dentry_open+0x1d8/0x340
[    3.437732] [<ffffff8008259ed0>] vfs_open+0x58/0x88
[    3.437735] [<ffffff800826d3b0>] do_last+0x530/0xfd0
[    3.437737] [<ffffff800826dee0>] path_openat+0x90/0x378
[    3.437740] [<ffffff800826f450>] do_filp_open+0x70/0xe8
[    3.437741] [<ffffff800825a394>] do_sys_open+0x174/0x258
[    3.437744] [<ffffff800825a4fc>] SyS_openat+0x3c/0x50
[    3.437746] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[    3.437751] SMP: stopping secondary CPUs
[    3.442246] Kernel Offset: disabled
[    3.442247] Memory Limit: none
[    3.604271] trusty-log panic notifier - trusty version Built: 08:40:58 Feb 19 2021 
[    3.604271] Rebooting in 5 seconds..

Thanks!

gavin.lofts · June 24, 2021, 1:02pm

One further piece of information that could help is if we make our service depend on nvpmodel.service by adding After: nvpmodel.service then there is no kernel panic.

neel_patel · June 25, 2021, 9:14pm

Thank You Gavin for the log and info. We are investigating this.

neel_patel · July 8, 2021, 6:50pm

Hi Gavin,

These logs are suppressed. you need to boot with ignore_loglevel in the kernel command line present in this file in the BSP : rootfs/boot/extlinux/extlinux.conf

or

if you can tell us the changes that were made to create this crash. We are unable to repro this right now.

gavin.lofts · July 9, 2021, 8:43am

Thank you for looking into this. I made no changes to create the crash. Just the instructions at the top of this post. I guess the version of module may make a difference due to DRAM timing. I have only seen this problem on D00 modules so far, but I haven’t tested other revisions.

I have taken the rig down which reproduced this problem, and I don’t think I will be able to put it back together.

We read this in system/nvzramconfig.service:
“TPC power gating must be enabled before anything touching gpu”
After=nvpmodel.service

I have accepted the above comment as the explanation for what I saw.

system · September 12, 2021, 6:09am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.