Kernel Crash on NVIDIA Jetson Orin NX Due to Real-Time Priority and 'cudaStreamCreate' Process

alon2 · November 24, 2023, 10:11am

Hey
We are encountering a critical issue on the NVIDIA Jetson Orin NX platform related to real-time priority settings, which is leading to kernel crashes, particularly when specific processes are executed on CPU0.

Detailed Description: On the NVIDIA Jetson Orin NX, setting real-time priority as per NVIDIA’s guidelines https://forums.developer.nvidia.com/t/shceduling-real-time-priority-linux-thread-on-agx-orin-failed/227497/8 seems to interfere with the OS kernel’s CPU scheduling capabilities. This issue is particularly evident when processes that require high CPU usage, such as those running the '~~InitCudaEngine~~ cudaStreamCreate(&mStream); function, are allocated to CPU0.

These processes, when given real-time priority, tend to hog CPU0 resources for prolonged periods. This excessive use is causing hardware instability and triggering kernel crashes. The problem is exacerbated by the fact that most IRQs (Interrupt Request Lines) are dependent on CPU0, making the system more vulnerable when these processes are running.

Steps to Reproduce

Set real-time priority parameters on NVIDIA Jetson Orin NX as per NVIDIA’s recommendations. ( echo -1 > /proc/sys/kernel/sched_rt_runtime_us)
Run a process that executes ‘~~InitCudaEngine~~’ cudaStreamCreate(&mStream); and build on CPU0.
Observe the system behavior for instability or kernel crashes.

Expected Behavior : Ideally, the OS kernel should manage CPU scheduling effectively, preventing any process, even those with real-time priority, from monopolizing resources and causing system instability or kernel crashes.
or a software that runs on CPU0 must know not to block too much.

Actual Behavior : On the NVIDIA Jetson Orin NX, processes with real-time priority, especially those executing '~~InitCudaEngine~~’cudaStreamCreate(&mStream); on CPU0, are leading to extended occupation of CPU resources, resulting in hardware instability and frequent kernel crashes.

System Information:

Device: NVIDIA Jetson Orin Devkit,
GPU Model: running as Orin NX
Operating System: JetPack 5.1.2

Query : How can I configure the real-time priority settings on the NVIDIA Jetson Orin NX to avoid these kernel crashes, especially when running processes like cudaStreamCreate(&mStream); on CPU0? Are there recommended practices or settings adjustments that can help mitigate this issue?

for now we are using taskset command to prevent process that runs Trt and Cuda to run on CPU0. but it feels like not the best idea

Thanks

EDIT

“We haveve identified a more specific cause of the issue: it appears that the function cudaStreamCreate(&mStream); is what’s leading to CPU0 hanging, particularly when the process running it has real-time priority and is executed on CPU0. This clarification is important as previously I suspected the InitCudaEngine function, but it turns out not to be the case.”

ShaneCCC · November 24, 2023, 1:20pm

Checking will update it next week.

AastaLLL · November 27, 2023, 2:34am

Hi,

The original issue is related to CONFIG_RT_GROUP_SCHED, which needs RT BW and runtime allocation properly.

There are two approaches to solving the issue.

/proc/sys/kernel/sched_rt_runtime_us  = -1

The above command turns on 100% CPU BW to make the rt task get enough CPU and RT bandwidth when CONFIG_RT_GROUP_SCHED is enabled. But it seems to lead to the issue you observe.

Could you try another solution to disable the CONFIG_RT_GROUP_SCHED directly?

Thanks.

AastaLLL · November 30, 2023, 7:51am

Hi,

Do you have a chance to give " disable the CONFIG_RT_GROUP_SCHED " a try?
Thanks.

alon2 · November 30, 2023, 5:15pm

hey AstaLLL,
thanks for answering .
two questions:

it seems that “-1” will still dominate on the runtime. i.e no limit. isnt?
(souce,kernel
2 can you elaborate more how to do it? can i disable CONFIG_RT_GROUP_SCHED without rebuilding the kernel?

AastaLLL · December 6, 2023, 7:02am

Hi,

You don’t need to set the sched_rt_runtime_us if the second solution is applied.
It is only required if CONFIG_RT_GROUP_SCHED is enabled.

To disable the configuration, please build a custom kernel.
You can find the detailed steps below:

https://docs.nvidia.com/jetson/archives/r35.4.1/DeveloperGuide/text/SD/Kernel/KernelCustomization.html#manually-downloading-and-expanding-kernel-sources

Thanks.

alon2 · December 6, 2023, 12:33pm

as my understanding, container runtime and k3s are using cgroup to control the bandwidth of the cpu/memory and other related resources. so disable the cgroup wont work for us

AastaLLL · December 7, 2023, 5:24am

Hi,

Thanks for the feedback.

We need to discuss this with our internal team.
Will share more info with you.

Thanks.

AastaLLL · December 18, 2023, 6:11am

Hi,

We need some extra kernel logs to move forward.

However, we are not able to reproduce this issue in our environment.
Based on your description, we run simpleStreams sample on the CPU 0 after setting the RT priority workaround.
But it can work correctly without any kernel crash or log.

$ sudo -s
$ echo -1 > /proc/sys/kernel/sched_rt_runtime_us

$ taskset -c 0 ./simpleStreams

Is there any missing in our testing?

Thanks.

alon2 · December 20, 2023, 8:42am

hey, thanks for the test, it seems that sometimes randomly we have more processes running on cpu0 and opening a stream ( we use different streams than the default).

out workaround is to use taskset for our process no to run on cpu0 at all. but it feels very hacky

AastaLLL · December 25, 2023, 2:24am

Hi,

Could you help to provide a way to reproduce this issue?
For now, we don’t see any issue after setting the sched_rt_runtime_us.

Our internal team needs some internal logs to check the CPU stall issue further.

Thanks.

alon2 · December 25, 2023, 7:01am

Certainly! Here’s a revised version of your post for the Nvidia forum:

Hey AstaLLL

It’s important to note that setting sched_rt_runtime_us to -1 can be hazardous. This configuration might lead to a situation where multiple processes, operating with real-time priority, do not relinquish control. Such a scenario can prevent the operating system from performing essential tasks. This is particularly critical for Jetson devices, as it could lead to a situation where if these processes monopolize CPU0, it would block hardware interrupts including those for SPI, Ethernet, CAN, etc.

Here’s a script example to illustrate the potential issue:

#!/bin/bash

# Install stress-ng
apt-get install -y stress-ng

# Run stress-ng on CPU0
taskset -c 0 stress-ng -c 1 -l 100 -t 90 &

# Run cangen on can0, send message very 20 mili
cangen -g 20 can0 &

# Short delay
sleep 0.2

# Get the PID of the stress-ng CPU process
my_pid=$(pidof "stress-ng-cpu")

# Set the process to real-time priority
chrt -f -p 99 $my_pid

AastaLLL · December 26, 2023, 3:59am

Hi,

Thanks for sharing the script.

We can reproduce the kernel crash internally and collect the essential logs.
Our internal team is now working on it. Will share more info with you later.

Thanks.

AastaLLL · December 27, 2023, 12:16pm

Hi,

To gather the RT logs, we also need the rt task that requires the RT priority.
Could you also share it with us?

Thanks.

alon2 · December 31, 2023, 8:27am

hey, I cannot share the code here. but in general take a task that output a trigger on gpio with a constant delay between pairs of ticks.

AastaLLL · January 2, 2024, 4:18am

Hi,

Could you help to provide a simple app or share the app through private messages?

Thanks.

AastaLLL · January 5, 2024, 4:17am

Hi,

We have some discussions with our internal team:

Why is it necessary to set CPU0 affinity for processes that require high CPU usage?
Is there any code logic that could potentially cause tasks to become stuck for an extended period of time?

If there are no issues, is it possible to allow the rt-task to run on any core? This is because setting /proc/sys/kernel/sched_rt_runtime_us = -1 grants unlimited bandwidth to rt-tasks. This means, if the rt-task becomes stuck, it can lead to a CPU stall.

If short, when setting sched_rt_runtime_us=-1, the task running on cpu0 cannot be stuck.
If the task needs time to finish, please run it on other cores to prevent cpu0 stall.

Thanks.

alon2 · January 7, 2024, 8:55am

Hey,
From our prepective it is not clear why we get the error in high rate. we do not block the CPU too much. but it happens usually on app the runs CUDA code that creates stream .

Our workaround is kinda what you suggested. we moved all our main apps to run on cpus 1-7 to free cpu0 to do HW work

AastaLLL · January 8, 2024, 8:04am

Hi,

If the cuda stream causes the kernel crash with a high ratio, please launch the application on other cores.
This issue looks more like a limitation rather than a bug.

Thanks.

system · January 22, 2024, 8:05am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson TX2 - CPU Core 0 Stuck at 100% Usage and System Crash Issues Jetson TX2 kernel , interrupt-moderation	7	91	November 18, 2024
Persistent kernel runs slower when with more threads Jetson Orin NX cuda	6	56	October 14, 2024
[Jetson AGX Orin] Kernel launch is frequently delayed after kernel launch Jetson AGX Orin cuda	15	65	May 6, 2025
“NETDEV WATCHDOG: eth0 (r8168) error Jetson Orin NX ethernet	11	841	June 4, 2024
System Freeze Issue When Unlocking the Screen Jetson Orin NX boot	31	136	May 7, 2025
Kernel crashes on specific Orin NX modules Jetson Orin NX kernel , board-design , linux	15	166	May 2, 2025
Jetson Orin NX nvidia-container-runtime Jetson Orin NX containers , kubernetes	8	1916	September 11, 2023
Jetson Orin NX not booting after multiple reboots Jetson Orin NX boot	8	1056	February 16, 2024
Network Packet Loss for Xavier NX vs Orin NX Jetson Orin NX nvbugs , networking	11	1978	June 27, 2023
Kernel panic when starting a CUDA application as a service Jetson TX2 cuda	10	1511	September 12, 2021

Kernel Crash on NVIDIA Jetson Orin NX Due to Real-Time Priority and 'cudaStreamCreate' Process

EDIT

Related topics