Kernel Crash on NVIDIA Jetson Orin NX Due to Real-Time Priority and 'cudaStreamCreate' Process

Hey
We are encountering a critical issue on the NVIDIA Jetson Orin NX platform related to real-time priority settings, which is leading to kernel crashes, particularly when specific processes are executed on CPU0.

Detailed Description: On the NVIDIA Jetson Orin NX, setting real-time priority as per NVIDIA’s guidelines https://forums.developer.nvidia.com/t/shceduling-real-time-priority-linux-thread-on-agx-orin-failed/227497/8 seems to interfere with the OS kernel’s CPU scheduling capabilities. This issue is particularly evident when processes that require high CPU usage, such as those running the 'InitCudaEngine cudaStreamCreate(&mStream); function, are allocated to CPU0.

These processes, when given real-time priority, tend to hog CPU0 resources for prolonged periods. This excessive use is causing hardware instability and triggering kernel crashes. The problem is exacerbated by the fact that most IRQs (Interrupt Request Lines) are dependent on CPU0, making the system more vulnerable when these processes are running.

Steps to Reproduce

  1. Set real-time priority parameters on NVIDIA Jetson Orin NX as per NVIDIA’s recommendations. ( echo -1 > /proc/sys/kernel/sched_rt_runtime_us)
  2. Run a process that executes ‘InitCudaEnginecudaStreamCreate(&mStream); and build on CPU0.
  3. Observe the system behavior for instability or kernel crashes.

Expected Behavior : Ideally, the OS kernel should manage CPU scheduling effectively, preventing any process, even those with real-time priority, from monopolizing resources and causing system instability or kernel crashes.
or a software that runs on CPU0 must know not to block too much.

Actual Behavior : On the NVIDIA Jetson Orin NX, processes with real-time priority, especially those executing 'InitCudaEnginecudaStreamCreate(&mStream); on CPU0, are leading to extended occupation of CPU resources, resulting in hardware instability and frequent kernel crashes.

System Information:

  • Device: NVIDIA Jetson Orin Devkit,
  • GPU Model: running as Orin NX
  • Operating System: JetPack 5.1.2

Query : How can I configure the real-time priority settings on the NVIDIA Jetson Orin NX to avoid these kernel crashes, especially when running processes like cudaStreamCreate(&mStream); on CPU0? Are there recommended practices or settings adjustments that can help mitigate this issue?

for now we are using taskset command to prevent process that runs Trt and Cuda to run on CPU0. but it feels like not the best idea

Thanks

EDIT

“We haveve identified a more specific cause of the issue: it appears that the function cudaStreamCreate(&mStream); is what’s leading to CPU0 hanging, particularly when the process running it has real-time priority and is executed on CPU0. This clarification is important as previously I suspected the InitCudaEngine function, but it turns out not to be the case.”

Checking will update it next week.

1 Like

Hi,

The original issue is related to CONFIG_RT_GROUP_SCHED, which needs RT BW and runtime allocation properly.

There are two approaches to solving the issue.

/proc/sys/kernel/sched_rt_runtime_us  = -1

The above command turns on 100% CPU BW to make the rt task get enough CPU and RT bandwidth when CONFIG_RT_GROUP_SCHED is enabled. But it seems to lead to the issue you observe.

Could you try another solution to disable the CONFIG_RT_GROUP_SCHED directly?

Thanks.

Hi,

Do you have a chance to give " disable the CONFIG_RT_GROUP_SCHED " a try?
Thanks.

hey AstaLLL,
thanks for answering .
two questions:


it seems that “-1” will still dominate on the runtime. i.e no limit. isnt?
(souce,kernel
2 can you elaborate more how to do it? can i disable CONFIG_RT_GROUP_SCHED without rebuilding the kernel?

Hi,

You don’t need to set the sched_rt_runtime_us if the second solution is applied.
It is only required if CONFIG_RT_GROUP_SCHED is enabled.

To disable the configuration, please build a custom kernel.
You can find the detailed steps below:

https://docs.nvidia.com/jetson/archives/r35.4.1/DeveloperGuide/text/SD/Kernel/KernelCustomization.html#manually-downloading-and-expanding-kernel-sources

Thanks.

as my understanding, container runtime and k3s are using cgroup to control the bandwidth of the cpu/memory and other related resources. so disable the cgroup wont work for us

Hi,

Thanks for the feedback.

We need to discuss this with our internal team.
Will share more info with you.

Thanks.

Hi,

We need some extra kernel logs to move forward.

However, we are not able to reproduce this issue in our environment.
Based on your description, we run simpleStreams sample on the CPU 0 after setting the RT priority workaround.
But it can work correctly without any kernel crash or log.

$ sudo -s
$ echo -1 > /proc/sys/kernel/sched_rt_runtime_us
$ taskset -c 0 ./simpleStreams

Is there any missing in our testing?

Thanks.

hey, thanks for the test, it seems that sometimes randomly we have more processes running on cpu0 and opening a stream ( we use different streams than the default).

out workaround is to use taskset for our process no to run on cpu0 at all. but it feels very hacky

Hi,

Could you help to provide a way to reproduce this issue?
For now, we don’t see any issue after setting the sched_rt_runtime_us.

Our internal team needs some internal logs to check the CPU stall issue further.

Thanks.

Certainly! Here’s a revised version of your post for the Nvidia forum:

Hey AstaLLL

It’s important to note that setting sched_rt_runtime_us to -1 can be hazardous. This configuration might lead to a situation where multiple processes, operating with real-time priority, do not relinquish control. Such a scenario can prevent the operating system from performing essential tasks. This is particularly critical for Jetson devices, as it could lead to a situation where if these processes monopolize CPU0, it would block hardware interrupts including those for SPI, Ethernet, CAN, etc.

Here’s a script example to illustrate the potential issue:

#!/bin/bash

# Install stress-ng
apt-get install -y stress-ng

# Run stress-ng on CPU0
taskset -c 0 stress-ng -c 1 -l 100 -t 90 &

# Run cangen on can0, send message very 20 mili
cangen -g 20 can0 &

# Short delay
sleep 0.2

# Get the PID of the stress-ng CPU process
my_pid=$(pidof "stress-ng-cpu")

# Set the process to real-time priority
chrt -f -p 99 $my_pid

Hi,

Thanks for sharing the script.

We can reproduce the kernel crash internally and collect the essential logs.
Our internal team is now working on it. Will share more info with you later.

Thanks.

Hi,

To gather the RT logs, we also need the rt task that requires the RT priority.
Could you also share it with us?

Thanks.

hey, I cannot share the code here. but in general take a task that output a trigger on gpio with a constant delay between pairs of ticks.

Hi,

Could you help to provide a simple app or share the app through private messages?

Thanks.

Hi,

We have some discussions with our internal team:

  1. Why is it necessary to set CPU0 affinity for processes that require high CPU usage?

  2. Is there any code logic that could potentially cause tasks to become stuck for an extended period of time?

If there are no issues, is it possible to allow the rt-task to run on any core? This is because setting /proc/sys/kernel/sched_rt_runtime_us = -1 grants unlimited bandwidth to rt-tasks. This means, if the rt-task becomes stuck, it can lead to a CPU stall.

If short, when setting sched_rt_runtime_us=-1, the task running on cpu0 cannot be stuck.
If the task needs time to finish, please run it on other cores to prevent cpu0 stall.

Thanks.

Hey,
From our prepective it is not clear why we get the error in high rate. we do not block the CPU too much. but it happens usually on app the runs CUDA code that creates stream .

Our workaround is kinda what you suggested. we moved all our main apps to run on cpus 1-7 to free cpu0 to do HW work

Hi,

If the cuda stream causes the kernel crash with a high ratio, please launch the application on other cores.
This issue looks more like a limitation rather than a bug.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.