Hey
We are encountering a critical issue on the NVIDIA Jetson Orin NX platform related to real-time priority settings, which is leading to kernel crashes, particularly when specific processes are executed on CPU0.
Detailed Description: On the NVIDIA Jetson Orin NX, setting real-time priority as per NVIDIA’s guidelines https://forums.developer.nvidia.com/t/shceduling-real-time-priority-linux-thread-on-agx-orin-failed/227497/8 seems to interfere with the OS kernel’s CPU scheduling capabilities. This issue is particularly evident when processes that require high CPU usage, such as those running the 'InitCudaEngine cudaStreamCreate(&mStream);
function, are allocated to CPU0.
These processes, when given real-time priority, tend to hog CPU0 resources for prolonged periods. This excessive use is causing hardware instability and triggering kernel crashes. The problem is exacerbated by the fact that most IRQs (Interrupt Request Lines) are dependent on CPU0, making the system more vulnerable when these processes are running.
Steps to Reproduce
- Set real-time priority parameters on NVIDIA Jetson Orin NX as per NVIDIA’s recommendations. (
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
) - Run a process that executes ‘
InitCudaEngine’cudaStreamCreate(&mStream);
and build on CPU0. - Observe the system behavior for instability or kernel crashes.
Expected Behavior : Ideally, the OS kernel should manage CPU scheduling effectively, preventing any process, even those with real-time priority, from monopolizing resources and causing system instability or kernel crashes.
or a software that runs on CPU0 must know not to block too much.
Actual Behavior : On the NVIDIA Jetson Orin NX, processes with real-time priority, especially those executing 'InitCudaEngine’cudaStreamCreate(&mStream);
on CPU0
, are leading to extended occupation of CPU resources, resulting in hardware instability and frequent kernel crashes.
System Information:
- Device: NVIDIA Jetson Orin Devkit,
- GPU Model: running as Orin NX
- Operating System: JetPack 5.1.2
Query : How can I configure the real-time priority settings on the NVIDIA Jetson Orin NX to avoid these kernel crashes, especially when running processes like cudaStreamCreate(&mStream);
on CPU0? Are there recommended practices or settings adjustments that can help mitigate this issue?
for now we are using taskset command
to prevent process that runs Trt
and Cuda
to run on CPU0. but it feels like not the best idea
Thanks
EDIT
“We haveve identified a more specific cause of the issue: it appears that the function cudaStreamCreate(&mStream);
is what’s leading to CPU0 hanging, particularly when the process running it has real-time priority and is executed on CPU0. This clarification is important as previously I suspected the InitCudaEngine
function, but it turns out not to be the case.”