Rcu_preempt caused by cuda-EvtHandlr?

Hi!

We are running a Triton Inference Server with a model for image object detection on an Orin NX with JetPack 6.1 on a ConnectTech Hadron NGX012 carrier board.

When the inference server’s load is high, meaning that we send enough images per second to have the GPU working at ~100%, we lose the connection to the Jetson. We know that the network and the GPUs are down, but the system continues to run, because after reboot, we can access logs with timestamps more recent than the connection lost.

This seems to happen independently of the power mode in which we are, we tried MAXN and 25W, and also independently if jetson_clocks is on or off.

At the moment we lose the connection the syslog file shows this:

Mar  5 14:30:08 orin-nx-2 kernel: [16614.780047] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780057] rcu:   0-...0: (0 ticks this GP) idle=db1/1/0x4000000000000002 softirq=2813838/2813838 fqs=9346 
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780066]        (detected by 4, t=21007 jiffies, g=4972141, q=16782)
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780069] Task dump for CPU 0:
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780071] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid:577067 ppid:575844 flags:0x00000806
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780078] Call trace:
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780080]  __switch_to+0x104/0x160
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780092]  0xffff98000c20
Mar  5 14:30:20 orin-nx-2 kernel: [16627.079686] nvme nvme0: I/O 10 QID 8 timeout, completion polled
Mar  5 14:30:21 orin-nx-2 kernel: [16627.395642] nvme nvme0: I/O 0 QID 4 timeout, completion polled
Mar  5 14:30:51 orin-nx-2 kernel: [16657.282582] nvme nvme0: I/O 125 QID 5 timeout, completion polled
Mar  5 14:30:51 orin-nx-2 kernel: [16657.282986] nvme nvme0: I/O 11 QID 8 timeout, completion polled
Mar  5 14:30:51 orin-nx-2 kernel: [16657.602601] nvme nvme0: I/O 61 QID 6 timeout, completion polled

These messages repeat in a loop until reboot.

We do not know what those messages mean or what we could try to debug the issue. Any help will be much appreciated.

Thanks,

Alex

Hi,

The error seems to be related to the CPU stall.

Do you have other CPU jobs at the same time?
Is the CPU also fully occupied?

Thanks.

There are some CPU processes but overall the device doesn’t look overloaded.
This is a screenshot of jtop just before the connection was lost.

Hi,

We will need to reproduce this issue internally to gather more. info.
Could you share the detailed steps to reproduce the issue?

Do you also have an Orin NX devkit?
If yes, could you also test this on our devkit to see if the same issue occurs?

Thanks.

Hi!

Unfortunately, we do not have an Orin NX devkit.
Our current setup cannot be shared, so we will try to create a minimal shareable setup that reproduces the issue. We will come back to you as soon as possible.

Thanks,

Alex