Hi!
We are running a Triton Inference Server with a model for image object detection on an Orin NX with JetPack 6.1 on a ConnectTech Hadron NGX012 carrier board.
When the inference server’s load is high, meaning that we send enough images per second to have the GPU working at ~100%, we lose the connection to the Jetson. We know that the network and the GPUs are down, but the system continues to run, because after reboot, we can access logs with timestamps more recent than the connection lost.
This seems to happen independently of the power mode in which we are, we tried MAXN and 25W, and also independently if jetson_clocks is on or off.
At the moment we lose the connection the syslog
file shows this:
Mar 5 14:30:08 orin-nx-2 kernel: [16614.780047] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Mar 5 14:30:08 orin-nx-2 kernel: [16614.780057] rcu: 0-...0: (0 ticks this GP) idle=db1/1/0x4000000000000002 softirq=2813838/2813838 fqs=9346
Mar 5 14:30:08 orin-nx-2 kernel: [16614.780066] (detected by 4, t=21007 jiffies, g=4972141, q=16782)
Mar 5 14:30:08 orin-nx-2 kernel: [16614.780069] Task dump for CPU 0:
Mar 5 14:30:08 orin-nx-2 kernel: [16614.780071] task:cuda-EvtHandlr state:R running task stack: 0 pid:577067 ppid:575844 flags:0x00000806
Mar 5 14:30:08 orin-nx-2 kernel: [16614.780078] Call trace:
Mar 5 14:30:08 orin-nx-2 kernel: [16614.780080] __switch_to+0x104/0x160
Mar 5 14:30:08 orin-nx-2 kernel: [16614.780092] 0xffff98000c20
Mar 5 14:30:20 orin-nx-2 kernel: [16627.079686] nvme nvme0: I/O 10 QID 8 timeout, completion polled
Mar 5 14:30:21 orin-nx-2 kernel: [16627.395642] nvme nvme0: I/O 0 QID 4 timeout, completion polled
Mar 5 14:30:51 orin-nx-2 kernel: [16657.282582] nvme nvme0: I/O 125 QID 5 timeout, completion polled
Mar 5 14:30:51 orin-nx-2 kernel: [16657.282986] nvme nvme0: I/O 11 QID 8 timeout, completion polled
Mar 5 14:30:51 orin-nx-2 kernel: [16657.602601] nvme nvme0: I/O 61 QID 6 timeout, completion polled
These messages repeat in a loop until reboot.
We do not know what those messages mean or what we could try to debug the issue. Any help will be much appreciated.
Thanks,
Alex