Kernel panic when starting a CUDA application as a service

I am using the Jetson TX2 development kit and L4T 32 (I am using L4T 32.4, but I have also tried my test case on L4T 32.5.1).

I found that if I enable a service for systemd to run my CUDA application, I get a panic on bootup, I have included the kernel log below. If I add a 15s delay, everything works as expected.

I have made a test case to reproduce this problem with a CUDA sample to (hopefully) simplify understanding:

  1. flash dev kit with L4T 32.5.1
  2. install cuda-samples-10-2
  3. make samples/1_Utilities/deviceQuery
  4. Write this systemd file:
    /etc/systemd/system/devicequery.service:
[Unit]
Description=CUDA test

[Service]
ExecStart=/usr/local/cuda-10.2/samples/1_Utilities/deviceQuery/deviceQuery
User=minit
Group=minit

[Install]
WantedBy=multi-user.target
  1. Enable the service sudo systemctl enable devicequery
  2. reboot

If I add a 15s delay by adding this line to my service:
ExecStartPre=/bin/sleep 15
everything works as expected.

Kernel log

[    3.362574] podgov: can't create debugfs directory
[    3.367919] Kernel panic - not syncing: nvhost_scale_emc_debug_init
[    3.374186] CPU: 5 PID: 4575 Comm: deviceQuery Not tainted 4.9.201-tegra #1
[    3.381136] Hardware name: quill (DT)
[    3.384789] Call trace:
[    3.387236] [<ffffff800808b9f8>] dump_backtrace+0x0/0x198
[    3.392627] [<ffffff800808bfbc>] show_stack+0x24/0x30
[    3.397671] [<ffffff800845abe8>] dump_stack+0xa0/0xc8
[    3.402715] [<ffffff80081c0a00>] panic+0x12c/0x2a8
[    3.407510] [<ffffff8008cbe228>] nvhost_scale_emc_debug_init.isra.12+0x128/0x1a0
[    3.414897] [<ffffff8008cbe5d4>] nvhost_pod_event_handler+0x334/0x400
[    3.421327] [<ffffff8008cbb4fc>] devfreq_add_device+0x284/0x408
[    3.427237] [<ffffff8008cbb6e4>] devm_devfreq_add_device+0x64/0xc0
[    3.433588] [<ffffff8000fc1080>] gk20a_scale_init+0xf0/0x190 [nvgpu]
[    3.440112] [<ffffff8000fba2f8>] gk20a_pm_finalize_poweron+0x370/0x400 [nvgpu]
[    3.447505] [<ffffff8000fba540>] gk20a_busy+0x1b8/0x4f0 [nvgpu]
[    3.453421] [<ffffff800878becc>] pm_generic_runtime_resume+0x3c/0x58
[    3.459765] [<ffffff800878e214>] __rpm_callback+0x74/0xa0
[    3.465154] [<ffffff800878e274>] rpm_callback+0x34/0x98
[    3.470369] [<ffffff800878f710>] rpm_resume+0x470/0x710
[    3.475585] [<ffffff800878f9fc>] __pm_runtime_resume+0x4c/0x70
[    3.481585] [<ffffff8000fba45c>] gk20a_busy+0xd4/0x4f0 [nvgpu]
[    3.487589] [<ffffff8000f9bf1c>] gk20a_ctrl_dev_open+0x8c/0x168 [nvgpu]
[    3.494196] [<ffffff8008261f6c>] chrdev_open+0x94/0x198
[    3.499413] [<ffffff8008258918>] do_dentry_open+0x1d8/0x340
[    3.504977] [<ffffff8008259ed0>] vfs_open+0x58/0x88
[    3.509847] [<ffffff800826d3b0>] do_last+0x530/0xfd0
[    3.514806] [<ffffff800826dee0>] path_openat+0x90/0x378
[    3.520021] [<ffffff800826f450>] do_filp_open+0x70/0xe8
[    3.525237] [<ffffff800825a394>] do_sys_open+0x174/0x258
[    3.530540] [<ffffff800825a4fc>] SyS_openat+0x3c/0x50
[    3.535589] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[    3.540893] SMP: stopping secondary CPUs
[    3.544818] Kernel Offset: disabled
[    3.548298] Memory Limit: none
[    3.551346] trusty-log panic notifier - trusty version Built: 08:40:58 Feb 19 2021 [    3.563195] Rebooting in 5 seconds..

Could you tell me what I need to wait for to prevent this panic?

Hello kayccc,

This is still an issue for me. I am able to work around by delaying my application, but it’s not great to delay my application by 15s. I posted here as this problem seems to be specific to TX2 and the TX2 CUDA driver.

Thanks,

Gavin

Hi Gavin,

we are investigating the issue. I suspect that the nvhost needs to initialise and thats what is causing this, but need to confirm.

Thank you neel_patel.

Hi Gavin ,
Can you share complete log ? that will help us to identify the branch you are using and the instrumentation you have done.

Thanks

The L4T is unchanged from here: L4T | NVIDIA Developer

Here’s the log:

SoC: tegra186
Model: NVIDIA P2771-0000-500
Board: NVIDIA P2771-0000
DRAM:  7.8 GiB
MMC:   sdhci@3400000: 1, sdhci@3460000: 0
Loading Environment from MMC... *** Warning - bad CRC, using default environment

In:    serial
Out:   serial
Err:   serial
Net:   
Warning: ethernet@2490000 using MAC address from ROM
eth0: ethernet@2490000
Hit any key to stop autoboot:  0 
MMC: no card present
switch to partitions #0, OK
mmc0(part 0) is current device
Scanning mmc 0:1...
Found /boot/extlinux/extlinux.conf
Retrieving file: /boot/extlinux/extlinux.conf
858 bytes read in 24 ms (34.2 KiB/s)
1:	primary kernel
Retrieving file: /boot/initrd
7236840 bytes read in 186 ms (37.1 MiB/s)
Retrieving file: /boot/Image
34338824 bytes read in 824 ms (39.7 MiB/s)
append: console=ttyS0,115200 androidboot.presilicon=true firmware_class.path=/etc/firmware root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=1-2  video=tegrafb no_console_suspend=1 earlycon=uart8250,mmio32,0x3100000 nvdumper_reserved=0x2772e0000 gpt rootfs.slot_suffix= usbcore.old_scheme_first=1 tegraid=18.1.2.0.0 maxcpus=6 boot.slot_suffix= boot.ratchetvalues=0.2031647.1 vpr_resize bl_prof_dataptr=0x10000@0x275840000 sdhci_tegra.en_boot_part_access=1 quiet root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 console=ttyS0,115200n8 console=tty0 fbcon=map:0 net.ifnames=0 isolcpus=1-2 
## Flattened Device Tree blob at 80000000
   Booting using the fdt blob at 0x80000000
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
ERROR: reserving fdt memory region failed (addr=0 size=0)
   Using Device Tree in place at 0000000080000000, end 0000000080060699
copying carveout for /host1x@13e00000/display-hub@15200000/display@15200000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15210000...
copying carveout for /host1x@13e00000/display-hub@15200000/display@15220000...

Starting kernel ...

[    0.000000] Booting Linux on physical CPU 0x100
[    0.000000] Linux version 4.9.201-tegra (buildbrain@mobile-u64-5285-d7000) (gcc version 7.3.1 20180425 [linaro-7.3-2018.05 revision d29120a424ecfbc167ef90065c0eeb7f91977701] (Linaro GCC 7.3-2018.05) ) #1 SMP PREEMPT Fri Feb 19 08:42:04 PST 2021
[    0.000000] Boot CPU: AArch64 Processor [411fd073]
[    0.000000] OF: fdt:memory scan node memory@80000000, reg size 80,
[    0.000000] OF: fdt: - 80000000 ,  70000000
[    0.000000] OF: fdt: - f0200000 ,  185600000
[    0.000000] OF: fdt: - 275e00000 ,  200000
[    0.000000] OF: fdt: - 276600000 ,  200000
[    0.000000] OF: fdt: - 277000000 ,  200000
[    0.000000] earlycon: uart8250 at MMIO32 0x0000000003100000 (options '')
[    0.000000] bootconsole [uart8250] enabled
[    1.923721] cgroup: cgroup2: unknown option "nsdelegate"
[    3.148498] using random self ethernet address
[    3.181618] using random host ethernet address
[    3.392964] podgov: can't create debugfs directory
[    3.407877] Kernel panic - not syncing: nvhost_scale_emc_debug_init
[    3.412642] random: crng init done
[    3.412644] random: 7 urandom warning(s) missed due to ratelimiting
[    3.423790] CPU: 0 PID: 4767 Comm: deviceQuery Not tainted 4.9.201-tegra #1
[    3.430740] Hardware name: quill (DT)
[    3.434394] Call trace:
[    3.436848] [<ffffff800808b9f8>] dump_backtrace+0x0/0x198
[    3.436854] [<ffffff800808bfbc>] show_stack+0x24/0x30
[    3.436858] [<ffffff800845abe8>] dump_stack+0xa0/0xc8
[    3.436863] [<ffffff80081c0a00>] panic+0x12c/0x2a8
[    3.436869] [<ffffff8008cbe228>] nvhost_scale_emc_debug_init.isra.12+0x128/0x1a0
[    3.436872] [<ffffff8008cbe5d4>] nvhost_pod_event_handler+0x334/0x400
[    3.436874] [<ffffff8008cbb4fc>] devfreq_add_device+0x284/0x408
[    3.436876] [<ffffff8008cbb6e4>] devm_devfreq_add_device+0x64/0xc0
[    3.437063] [<ffffff8000fc1080>] gk20a_scale_init+0xf0/0x190 [nvgpu]
[    3.437233] [<ffffff8000fba2f8>] gk20a_pm_finalize_poweron+0x370/0x400 [nvgpu]
[    3.437393] [<ffffff8000fba540>] gk20a_busy+0x1b8/0x4f0 [nvgpu]
[    3.437398] [<ffffff800878becc>] pm_generic_runtime_resume+0x3c/0x58
[    3.437401] [<ffffff800878e214>] __rpm_callback+0x74/0xa0
[    3.437402] [<ffffff800878e274>] rpm_callback+0x34/0x98
[    3.437404] [<ffffff800878f710>] rpm_resume+0x470/0x710
[    3.437405] [<ffffff800878f9fc>] __pm_runtime_resume+0x4c/0x70
[    3.437570] [<ffffff8000fba45c>] gk20a_busy+0xd4/0x4f0 [nvgpu]
[    3.437724] [<ffffff8000f9bf1c>] gk20a_ctrl_dev_open+0x8c/0x168 [nvgpu]
[    3.437727] [<ffffff8008261f6c>] chrdev_open+0x94/0x198
[    3.437730] [<ffffff8008258918>] do_dentry_open+0x1d8/0x340
[    3.437732] [<ffffff8008259ed0>] vfs_open+0x58/0x88
[    3.437735] [<ffffff800826d3b0>] do_last+0x530/0xfd0
[    3.437737] [<ffffff800826dee0>] path_openat+0x90/0x378
[    3.437740] [<ffffff800826f450>] do_filp_open+0x70/0xe8
[    3.437741] [<ffffff800825a394>] do_sys_open+0x174/0x258
[    3.437744] [<ffffff800825a4fc>] SyS_openat+0x3c/0x50
[    3.437746] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[    3.437751] SMP: stopping secondary CPUs
[    3.442246] Kernel Offset: disabled
[    3.442247] Memory Limit: none
[    3.604271] trusty-log panic notifier - trusty version Built: 08:40:58 Feb 19 2021 
[    3.604271] Rebooting in 5 seconds..

Thanks!

One further piece of information that could help is if we make our service depend on nvpmodel.service by adding After: nvpmodel.service then there is no kernel panic.

Thank You Gavin for the log and info. We are investigating this.

Hi Gavin,

These logs are suppressed. you need to boot with ignore_loglevel in the kernel command line present in this file in the BSP : rootfs/boot/extlinux/extlinux.conf

or

if you can tell us the changes that were made to create this crash. We are unable to repro this right now.

Thank you for looking into this. I made no changes to create the crash. Just the instructions at the top of this post. I guess the version of module may make a difference due to DRAM timing. I have only seen this problem on D00 modules so far, but I haven’t tested other revisions.

I have taken the rig down which reproduced this problem, and I don’t think I will be able to put it back together.

We read this in system/nvzramconfig.service:
“TPC power gating must be enabled before anything touching gpu”
After=nvpmodel.service

I have accepted the above comment as the explanation for what I saw.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.