Rcu_preempt caused by cuda-EvtHandlr?

Hi!

We are running a Triton Inference Server with a model for image object detection on an Orin NX with JetPack 6.1 on a ConnectTech Hadron NGX012 carrier board.

When the inference server’s load is high, meaning that we send enough images per second to have the GPU working at ~100%, we lose the connection to the Jetson. We know that the network and the GPUs are down, but the system continues to run, because after reboot, we can access logs with timestamps more recent than the connection lost.

This seems to happen independently of the power mode in which we are, we tried MAXN and 25W, and also independently if jetson_clocks is on or off.

At the moment we lose the connection the syslog file shows this:

Mar  5 14:30:08 orin-nx-2 kernel: [16614.780047] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780057] rcu:   0-...0: (0 ticks this GP) idle=db1/1/0x4000000000000002 softirq=2813838/2813838 fqs=9346 
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780066]        (detected by 4, t=21007 jiffies, g=4972141, q=16782)
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780069] Task dump for CPU 0:
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780071] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid:577067 ppid:575844 flags:0x00000806
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780078] Call trace:
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780080]  __switch_to+0x104/0x160
Mar  5 14:30:08 orin-nx-2 kernel: [16614.780092]  0xffff98000c20
Mar  5 14:30:20 orin-nx-2 kernel: [16627.079686] nvme nvme0: I/O 10 QID 8 timeout, completion polled
Mar  5 14:30:21 orin-nx-2 kernel: [16627.395642] nvme nvme0: I/O 0 QID 4 timeout, completion polled
Mar  5 14:30:51 orin-nx-2 kernel: [16657.282582] nvme nvme0: I/O 125 QID 5 timeout, completion polled
Mar  5 14:30:51 orin-nx-2 kernel: [16657.282986] nvme nvme0: I/O 11 QID 8 timeout, completion polled
Mar  5 14:30:51 orin-nx-2 kernel: [16657.602601] nvme nvme0: I/O 61 QID 6 timeout, completion polled

These messages repeat in a loop until reboot.

We do not know what those messages mean or what we could try to debug the issue. Any help will be much appreciated.

Thanks,

Alex

Hi,

The error seems to be related to the CPU stall.

Do you have other CPU jobs at the same time?
Is the CPU also fully occupied?

Thanks.

There are some CPU processes but overall the device doesn’t look overloaded.
This is a screenshot of jtop just before the connection was lost.

Hi,

We will need to reproduce this issue internally to gather more. info.
Could you share the detailed steps to reproduce the issue?

Do you also have an Orin NX devkit?
If yes, could you also test this on our devkit to see if the same issue occurs?

Thanks.

Hi!

Unfortunately, we do not have an Orin NX devkit.
Our current setup cannot be shared, so we will try to create a minimal shareable setup that reproduces the issue. We will come back to you as soon as possible.

Thanks,

Alex

Is this still an issue to support? Any result can be shared?

We are attempting to reproduce the issue with a setup that only involves the Triton Inference Server and perf_analyzer. Although we are still encountering network loss on the device, we have not observed the rcu_preempt issue so far.

It’s possible that the network loss is unrelated to the original CPU problem, but the logs we get when this happens using exclusively Nvidia tools are not very informative, at least those from syslog.

Would you have any suggestions on which logs to look or what could be enabled on the device to gather more in-depth information about what happens at the moment the network is lost?

Hi,

Are you able to provide more info about network loss?
Any log or error from the syslog or dmesg?

We will need more info or some description about your setting to find the corresponding to help.
(as the problem is not related to GPU anymore)

Thanks.

We recently received an NVIDIA DevKit with the Jetson Xavier NX 16GB module for testing purposes and encountered the same network issue.

At certain moments, we lose network connectivity. Attached are the relevant logs captured during one of these disconnections (Disconnections happening around 11:35 and reconnection at 11:46 after what looks like a system reboot).

The device is running a set of services via Docker Compose. During the network outages, we are unable to access these services externally. Additionally, logs from the containers indicate that inter-service communication within the device also stops during these periods, suggesting that the issue affects both external and internal networking.

Can provide any guidance on how to further diagnose and resolve this issue?

Thanks!

Alex

Apr 12 11:34:56 ubuntu kernel: [ 3498.589580] tegra-hda 3510000.hda: azx_get_response timeout, switching to polling mode: last cmd=0x004f0900
Apr 12 11:35:13 ubuntu kernel: [ 3505.513365] ---- syncpts ----
Apr 12 11:35:13 ubuntu kernel: [ 3505.513377] id 0 (reserved) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513381] id 1 (1-pva_syncpt) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513384] id 2 (2-pva_syncpt) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513388] id 3 (3-pva_syncpt) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513391] id 4 (4-pva_syncpt) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513394] id 5 (5-pva_syncpt) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513397] id 6 (6-pva_syncpt) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513399] id 7 (7-pva_syncpt) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513402] id 8 (8-pva_syncpt) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513406] id 9 (9-15340000.vic) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513409] id 10 (10-15480000.nvdec) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513412] id 11 (11-154c0000.nvenc) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513416] id 12 (12-15380000.nvjpg) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513419] id 13 (13-15540000.nvjpg) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513422] id 14 (14-15a50000.ofa) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513426] id 15 (15-ga10b_511_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513429] id 16 (16-ga10b_510_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513432] id 17 (17-ga10b_509_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513435] id 18 (18-ga10b_508_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513437] id 19 (19-ga10b_507_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513439] id 20 (20-ga10b_506_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513442] id 21 (21-ga10b_505_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513444] id 22 (22-ga10b_504_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513446] id 23 (23-ga10b_503_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513448] id 24 (24-ga10b_502_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513451] id 25 (25-ga10b_501_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513453] id 26 (26-ga10b_500_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513455] id 27 (27-ga10b_499_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513458] id 28 (28-ga10b_498_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513459] id 29 (29-ga10b_497_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513462] id 30 (30-ga10b_496_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513464] id 31 (31-queue3:src) min 309786 max 309786 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513467] id 32 (32-nvv4l2decoder0:) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513469] id 33 (33-ga10b_495_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513471] id 34 (34-ga10b_494_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513474] id 35 (35-ga10b_493_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513476] id 36 (36-ga10b_492_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513478] id 37 (37-ga10b_491_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513480] id 38 (38-ga10b_490_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513482] id 39 (39-ga10b_489_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513485] id 40 (40-ga10b_488_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513487] id 41 (41-ga10b_487_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513489] id 42 (42-ga10b_486_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513491] id 43 (43-ga10b_485_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513493] id 44 (44-ga10b_484_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513495] id 45 (45-ga10b_483_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513497] id 46 (46-ga10b_482_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513499] id 47 (47-ga10b_481_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513501] id 48 (48-ga10b_480_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513505] id 49 (49-nvv4l2decoder0:) min 157512 max 157512 (2 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513508] id 50 (50-queue3:src) min 309735 max 309735 (1 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513510] id 51 (51-nvv4l2decoder0:) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513512] id 52 (52-ga10b_479_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513515] id 53 (53-ga10b_478_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513517] id 54 (54-ga10b_477_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513519] id 55 (55-ga10b_476_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513521] id 56 (56-ga10b_475_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513524] id 57 (57-ga10b_474_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513526] id 58 (58-ga10b_473_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513528] id 59 (59-ga10b_472_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513530] id 60 (60-nvv4l2decoder1:) min 224102 max 224102 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513532] id 61 (61-nvv4l2decoder2:) min 320036 max 320036 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513534] id 62 (62-ga10b_471_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513536] id 63 (63-ga10b_470_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513538] id 64 (64-ga10b_469_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513540] id 65 (65-ga10b_468_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513542] id 66 (66-ga10b_467_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513545] id 67 (67-ga10b_466_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513547] id 68 (68-ga10b_465_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513549] id 69 (69-ga10b_464_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513552] id 70 (70-nvv4l2decoder0:) min 157444 max 157444 (2 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513554] id 71 (71-nvv4l2decoder0:) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513557] id 72 (72-nvv4l2decoder0:) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513559] id 73 (73-ga10b_463_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513561] id 74 (74-ga10b_462_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513564] id 75 (75-ga10b_461_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513566] id 76 (76-ga10b_460_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513568] id 77 (77-ga10b_459_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513570] id 78 (78-ga10b_458_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513572] id 79 (79-ga10b_457_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513574] id 80 (80-ga10b_456_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513576] id 81 (81-ga10b_455_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513578] id 82 (82-ga10b_454_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513580] id 83 (83-ga10b_453_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513583] id 84 (84-ga10b_452_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513585] id 85 (85-ga10b_451_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513587] id 86 (86-ga10b_450_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513589] id 87 (87-ga10b_449_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513591] id 88 (88-ga10b_448_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513593] id 89 (89-ga10b_447_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513595] id 90 (90-ga10b_446_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513597] id 91 (91-ga10b_445_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513600] id 92 (92-ga10b_444_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513602] id 93 (93-ga10b_443_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513604] id 94 (94-ga10b_442_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513606] id 95 (95-ga10b_441_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513608] id 96 (96-ga10b_440_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513610] id 97 (97-ga10b_439_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513612] id 98 (98-ga10b_438_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513614] id 99 (99-ga10b_437_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513617] id 100 (100-ga10b_436_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513619] id 101 (101-ga10b_435_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513621] id 102 (102-ga10b_434_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513623] id 103 (103-ga10b_433_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513625] id 104 (104-ga10b_432_user) min 0 max 0 (0 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513628] id 105 (105-nvv4l2decoder0:) min 157412 max 157412 (2 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3505.513631] id 106 (106-nvv4l2decoder0:) min 157416 max 157416 (2 waiters)
Apr 12 11:35:13 ubuntu kernel: [ 3514.729062] ------------[ cut here ]------------
Apr 12 11:35:13 ubuntu kernel: [ 3514.729067] NETDEV WATCHDOG: enP8p1s0 (r8168): transmit queue 0 timed out
Apr 12 11:35:13 ubuntu kernel: [ 3514.729085] WARNING: CPU: 7 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x3ac/0x3d0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729099] Modules linked in: veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter br_netfilter lzo_rle lzo_compress zram zsmalloc nvme_fabrics ramoops reed_solomon bridge stp llc usb_f_ncm usb_f_mass_storage algif_hash usb_f_acm u_serial algif_skcipher af_alg bnep usb_f_rndis u_ether libcomposite snd_soc_tegra186_asrc(O) snd_soc_tegra210_ope(O) snd_soc_tegra186_dspk(O) snd_soc_tegra210_admaif(O) snd_soc_tegra210_mixer(O) snd_soc_tegra210_afc(O) snd_soc_tegra_pcm snd_soc_tegra210_mvc(O) snd_soc_tegra186_arad(O) snd_soc_tegra210_dmic(O) snd_soc_tegra210_sfc(O) snd_soc_tegra210_i2s(O) snd_soc_tegra210_amx(O) snd_soc_tegra210_adx(O) snd_soc_tegra210_ahub(O) tegra210_adma spidev nvvrs_pseq_rtc(O) rtk_btusb(O) btusb btrtl btintel btbcm tegra234_oc_event(O) crct10dif_ce snd_soc_tegra_machine_driver(O) snd_soc_tegra_utils(O) fusb301(O) snd_soc_simple_card_utils
Apr 12 11:35:13 ubuntu kernel: [ 3514.729154]  nvpmodel_clk_cap(O) mttcan(O) nvpps(O) tegra_cactmon_mc_all(O) thermal_trip_event(O) can_dev tegra234_aon(O) tegra_aconnect snd_hda_codec_hdmi rtl8822ce(O) pwm_tegra_tachometer(O) snd_hda_tegra snd_hda_codec at24 snd_hda_core cfg80211 spi_tegra114 mc_hwpm(O) r8168(O) nvidia(O) tegra_pcie_dma_test(O) nvidia_vrs_pseq(O) host1x_fence(O) tegra_pcie_edma(O) tegra_dce(O) tsecriscv(O) nvhost_isp5(O) nvhost_vi5(O) nvhost_nvcsi_t194(O) tegra_camera(O) v4l2_dv_timings nvhost_nvcsi(O) tegra_camera_platform(O) capture_ivc(O) tegra_camera_rtcpu(O) ivc_bus(O) hsp_mailbox_client(O) ivc_ext(O) governor_userspace v4l2_fwnode v4l2_async tegra_drm(O) videobuf2_dma_contig nvhost_pva(O) nvhost_nvdla(O) tegra_wmark(O) nvhwpm(O) videobuf2_memops nvhost_capture(O) videobuf2_v4l2 videobuf2_common cec videodev mc host1x_nvhost(O) drm_kms_helper nvidia_p2p(O) ina3221 nvgpu(O) governor_pod_scaling(O) host1x(O) mc_utils(O) nvmap(O) nvsciipc(O) fuse drm ip_tables x_tables ipv6 pwm_fan pwm_tegra
Apr 12 11:35:13 ubuntu kernel: [ 3514.729212]  tegra_bpmp_thermal tegra_xudc ucsi_ccg typec_ucsi typec nvme nvme_core phy_tegra194_p2u pcie_tegra194
Apr 12 11:35:13 ubuntu kernel: [ 3514.729222] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G           O      5.15.148-tegra #1
Apr 12 11:35:13 ubuntu kernel: [ 3514.729225] Hardware name: NVIDIA NVIDIA Jetson Orin NX Engineering Reference Developer Kit/Jetson, BIOS 36.4.3-gcid-38968081 01/08/2025
Apr 12 11:35:13 ubuntu kernel: [ 3514.729227] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Apr 12 11:35:13 ubuntu kernel: [ 3514.729229] pc : dev_watchdog+0x3ac/0x3d0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729232] lr : dev_watchdog+0x3ac/0x3d0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729234] sp : ffff80000803bd40
Apr 12 11:35:13 ubuntu kernel: [ 3514.729235] x29: ffff80000803bd40 x28: 0000000000000004 x27: 0000000000000004
Apr 12 11:35:13 ubuntu kernel: [ 3514.729237] x26: ffff0000877d0c40 x25: ffff000094ca0480 x24: 0000000000000140
Apr 12 11:35:13 ubuntu kernel: [ 3514.729240] x23: ffff000094ca03dc x22: 00000000ffffffff x21: ffffdfe008576000
Apr 12 11:35:13 ubuntu kernel: [ 3514.729242] x20: ffff000094ca0000 x19: 0000000000000000 x18: ffffffffffffffff
Apr 12 11:35:13 ubuntu kernel: [ 3514.729244] x17: ffff2023e835e000 x16: ffffdfe007081920 x15: ffffdfe00898ff51
Apr 12 11:35:13 ubuntu kernel: [ 3514.729246] x14: ffffffffffffffff x13: 74756f2064656d69 x12: 7420302065756575
Apr 12 11:35:13 ubuntu kernel: [ 3514.729248] x11: 712074696d736e61 x10: 7274203a29383631 x9 : 3836313872282030
Apr 12 11:35:13 ubuntu kernel: [ 3514.729250] x8 : 73317038506e6520 x7 : 3a474f4448435441 x6 : 572056454454454e
Apr 12 11:35:13 ubuntu kernel: [ 3514.729252] x5 : ffff0003f02cb9f0 x4 : 00000000fffff4e1 x3 : ffffdfe0085fb360
Apr 12 11:35:13 ubuntu kernel: [ 3514.729254] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff00008024dd00
Apr 12 11:35:13 ubuntu kernel: [ 3514.729256] Call trace:
Apr 12 11:35:13 ubuntu kernel: [ 3514.729258]  dev_watchdog+0x3ac/0x3d0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729260]  call_timer_fn+0x44/0x1f0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729266]  __run_timers.part.0+0x230/0x300
Apr 12 11:35:13 ubuntu kernel: [ 3514.729269]  run_timer_softirq+0x48/0x90
Apr 12 11:35:13 ubuntu kernel: [ 3514.729271]  __do_softirq+0x130/0x3d0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729274]  irq_exit+0xd0/0x100
Apr 12 11:35:13 ubuntu kernel: [ 3514.729279]  handle_domain_irq+0x78/0xc0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729281]  gic_handle_irq+0x68/0x170
Apr 12 11:35:13 ubuntu kernel: [ 3514.729285]  call_on_irq_stack+0x20/0x50
Apr 12 11:35:13 ubuntu kernel: [ 3514.729288]  do_interrupt_handler+0x78/0xa0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729291]  el1_interrupt+0x30/0x70
Apr 12 11:35:13 ubuntu kernel: [ 3514.729295]  el1h_64_irq_handler+0x18/0x30
Apr 12 11:35:13 ubuntu kernel: [ 3514.729297]  el1h_64_irq+0x7c/0x80
Apr 12 11:35:13 ubuntu kernel: [ 3514.729298]  cpuidle_enter_state+0xbc/0x3e0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729301]  cpuidle_enter+0x44/0x70
Apr 12 11:35:13 ubuntu kernel: [ 3514.729303]  do_idle+0x228/0x2b0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729306]  cpu_startup_entry+0x34/0x70
Apr 12 11:35:13 ubuntu kernel: [ 3514.729308]  secondary_start_kernel+0x14c/0x1a0
Apr 12 11:35:13 ubuntu kernel: [ 3514.729311]  __secondary_switched+0x90/0x94
Apr 12 11:35:13 ubuntu kernel: [ 3514.729314] ---[ end trace 9e9c20c30a250810 ]---
Apr 12 11:35:13 ubuntu kernel: [ 3515.734280] r8168 0008:01:00.0 enP8p1s0: Transmit timeout reset Device!
Apr 12 11:35:13 ubuntu kernel: [ 3515.781037] r8168 0008:01:00.0 enP8p1s0: Device reseting!
Apr 12 11:35:14 ubuntu kernel: [ 3516.361007] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 12 11:35:14 ubuntu kernel: [ 3516.361013] rcu: 	0-...0: (1 GPs behind) idle=3f9/1/0x4000000000000002 softirq=599505/599506 fqs=2255 
Apr 12 11:35:14 ubuntu kernel: [ 3516.361021] 	(detected by 4, t=5252 jiffies, g=1078301, q=5686)
Apr 12 11:35:14 ubuntu kernel: [ 3516.361025] Task dump for CPU 0:
Apr 12 11:35:14 ubuntu kernel: [ 3516.361027] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid: 5884 ppid:  5186 flags:0x00000806
Apr 12 11:35:14 ubuntu kernel: [ 3516.361034] Call trace:
Apr 12 11:35:14 ubuntu kernel: [ 3516.361035]  __switch_to+0x104/0x160
Apr 12 11:35:14 ubuntu kernel: [ 3516.361049]  0xffff9c000c20
Apr 12 11:35:27 ubuntu kernel: [ 3529.572848] nvme nvme0: I/O 719 QID 3 timeout, completion polled
Apr 12 11:35:27 ubuntu kernel: [ 3530.020781] nvme nvme0: I/O 161 QID 2 timeout, completion polled
Apr 12 11:35:57 ubuntu kernel: [ 3560.035800] nvme nvme0: I/O 733 QID 3 timeout, completion polled
Apr 12 11:36:16 ubuntu kernel: [ 3578.403105] nvme nvme0: I/O 89 QID 4 timeout, completion polled
Apr 12 11:36:17 ubuntu kernel: [ 3579.379037] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 12 11:36:17 ubuntu kernel: [ 3579.379045] rcu: 	0-...0: (1 GPs behind) idle=3f9/1/0x4000000000000002 softirq=599505/599506 fqs=9193 
Apr 12 11:36:17 ubuntu kernel: [ 3579.379051] 	(detected by 5, t=21007 jiffies, g=1078301, q=20547)
Apr 12 11:36:17 ubuntu kernel: [ 3579.379054] Task dump for CPU 0:
Apr 12 11:36:17 ubuntu kernel: [ 3579.379057] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid: 5884 ppid:  5186 flags:0x00000806
Apr 12 11:36:17 ubuntu kernel: [ 3579.379063] Call trace:
Apr 12 11:36:17 ubuntu kernel: [ 3579.379064]  __switch_to+0x104/0x160
Apr 12 11:36:17 ubuntu kernel: [ 3579.379079]  0xffff9c000c20
Apr 12 11:36:28 ubuntu kernel: [ 3590.242780] nvme nvme0: I/O 734 QID 3 timeout, completion polled
Apr 12 11:36:28 ubuntu kernel: [ 3590.818731] nvme nvme0: I/O 870 QID 7 timeout, completion polled
Apr 12 11:36:46 ubuntu kernel: [ 3608.418174] nvme nvme0: I/O 118 QID 4 timeout, completion polled
Apr 12 11:36:58 ubuntu kernel: [ 3620.453970] nvme nvme0: I/O 739 QID 3 timeout, completion polled
Apr 12 11:36:59 ubuntu kernel: [ 3621.537788] nvme nvme0: I/O 887 QID 7 timeout, completion polled
Apr 12 11:37:16 ubuntu kernel: [ 3638.625230] nvme nvme0: I/O 119 QID 4 timeout, completion polled
Apr 12 11:37:20 ubuntu kernel: [ 3642.397080] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 12 11:37:20 ubuntu kernel: [ 3642.397087] rcu: 	0-...0: (1 GPs behind) idle=3f9/1/0x4000000000000002 softirq=599505/599506 fqs=15060 
Apr 12 11:37:20 ubuntu kernel: [ 3642.397093] 	(detected by 6, t=36762 jiffies, g=1078301, q=35158)
Apr 12 11:37:20 ubuntu kernel: [ 3642.397096] Task dump for CPU 0:
Apr 12 11:37:20 ubuntu kernel: [ 3642.397098] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid: 5884 ppid:  5186 flags:0x00000806
Apr 12 11:37:20 ubuntu kernel: [ 3642.397104] Call trace:
Apr 12 11:37:20 ubuntu kernel: [ 3642.397105]  __switch_to+0x104/0x160
Apr 12 11:37:20 ubuntu kernel: [ 3642.397117]  0xffff9c000c20
Apr 12 11:37:28 ubuntu kernel: [ 3650.656862] nvme nvme0: I/O 181 QID 6 timeout, completion polled
Apr 12 11:37:33 ubuntu kernel: [ 3655.776712] nvme nvme0: I/O 839 QID 7 timeout, completion polled
Apr 12 11:37:46 ubuntu kernel: [ 3668.832344] nvme nvme0: I/O 120 QID 4 timeout, completion polled
Apr 12 11:37:51 ubuntu kernel: [ 3673.952160] nvme nvme0: I/O 740 QID 3 timeout, completion polled
Apr 12 11:37:54 ubuntu systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!
Apr 12 11:37:54 ubuntu systemd[1]: systemd-logind.service: Killing process 538 (systemd-logind) with signal SIGABRT.
Apr 12 11:38:07 ubuntu kernel: [ 3689.311673] nvme nvme0: I/O 858 QID 7 timeout, completion polled
Apr 12 11:38:16 ubuntu kernel: [ 3699.039532] nvme nvme0: I/O 123 QID 4 timeout, completion polled
Apr 12 11:38:23 ubuntu kernel: [ 3705.415131] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 12 11:38:23 ubuntu kernel: [ 3705.415139] rcu: 	0-...0: (1 GPs behind) idle=3f9/1/0x4000000000000002 softirq=599505/599506 fqs=21278 
Apr 12 11:38:23 ubuntu kernel: [ 3705.415146] 	(detected by 5, t=52517 jiffies, g=1078301, q=42714)
Apr 12 11:38:23 ubuntu kernel: [ 3705.415149] Task dump for CPU 0:
Apr 12 11:38:23 ubuntu kernel: [ 3705.415151] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid: 5884 ppid:  5186 flags:0x00000806
Apr 12 11:38:23 ubuntu kernel: [ 3705.415157] Call trace:
Apr 12 11:38:23 ubuntu kernel: [ 3705.415158]  __switch_to+0x104/0x160
Apr 12 11:38:23 ubuntu kernel: [ 3705.415169]  0xffff9c000c20
Apr 12 11:38:24 ubuntu kernel: [ 3706.399187] nvme nvme0: I/O 741 QID 3 timeout, completion polled
Apr 12 11:38:48 ubuntu kernel: [ 3730.270441] nvme nvme0: I/O 124 QID 4 timeout, completion polled
Apr 12 11:38:48 ubuntu kernel: [ 3730.270464] nvme nvme0: I/O 860 QID 7 timeout, completion polled
Apr 12 11:38:54 ubuntu kernel: [ 3736.674231] nvme nvme0: I/O 743 QID 3 timeout, completion polled
Apr 12 11:38:58 ubuntu kernel: [ 3740.766098] nvme nvme0: I/O 182 QID 6 timeout, completion polled
Apr 12 11:39:18 ubuntu kernel: [ 3760.477474] nvme nvme0: I/O 124 QID 4 timeout, completion polled
Apr 12 11:39:24 ubuntu systemd[1]: systemd-logind.service: State 'stop-watchdog' timed out. Killing.
Apr 12 11:39:24 ubuntu systemd[1]: systemd-logind.service: Killing process 538 (systemd-logind) with signal SIGKILL.
Apr 12 11:39:24 ubuntu kernel: [ 3766.877278] nvme nvme0: I/O 744 QID 3 timeout, completion polled
Apr 12 11:39:26 ubuntu kernel: [ 3768.433193] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 12 11:39:26 ubuntu kernel: [ 3768.433200] rcu: 	0-...0: (1 GPs behind) idle=3f9/1/0x4000000000000002 softirq=599505/599506 fqs=27504 
Apr 12 11:39:26 ubuntu kernel: [ 3768.433207] 	(detected by 5, t=68272 jiffies, g=1078301, q=47498)
Apr 12 11:39:26 ubuntu kernel: [ 3768.433210] Task dump for CPU 0:
Apr 12 11:39:26 ubuntu kernel: [ 3768.433212] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid: 5884 ppid:  5186 flags:0x00000806
Apr 12 11:39:26 ubuntu kernel: [ 3768.433218] Call trace:
Apr 12 11:39:26 ubuntu kernel: [ 3768.433219]  __switch_to+0x104/0x160
Apr 12 11:39:26 ubuntu kernel: [ 3768.433231]  0xffff9c000c20
Apr 12 11:39:29 ubuntu kernel: [ 3771.489278] nvme nvme0: I/O 145 QID 6 timeout, completion polled
Apr 12 11:39:49 ubuntu kernel: [ 3791.712548] nvme nvme0: I/O 125 QID 4 timeout, completion polled
Apr 12 11:39:54 ubuntu kernel: [ 3797.084350] nvme nvme0: I/O 745 QID 3 timeout, completion polled
Apr 12 11:40:20 ubuntu kernel: [ 3822.431684] nvme nvme0: I/O 64 QID 4 timeout, completion polled
Apr 12 11:40:20 ubuntu kernel: [ 3822.431701] nvme nvme0: I/O 155 QID 6 timeout, completion polled
Apr 12 11:40:25 ubuntu kernel: [ 3827.291462] nvme nvme0: I/O 746 QID 3 timeout, completion polled
Apr 12 11:40:29 ubuntu kernel: [ 3831.451263] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 12 11:40:29 ubuntu kernel: [ 3831.451270] rcu: 	0-...0: (1 GPs behind) idle=3f9/1/0x4000000000000002 softirq=599505/599506 fqs=33776 
Apr 12 11:40:29 ubuntu kernel: [ 3831.451278] 	(detected by 2, t=84027 jiffies, g=1078301, q=52226)
Apr 12 11:40:29 ubuntu kernel: [ 3831.451281] Task dump for CPU 0:
Apr 12 11:40:29 ubuntu kernel: [ 3831.451283] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid: 5884 ppid:  5186 flags:0x00000806
Apr 12 11:40:29 ubuntu kernel: [ 3831.451290] Call trace:
Apr 12 11:40:29 ubuntu kernel: [ 3831.451291]  __switch_to+0x104/0x160
Apr 12 11:40:29 ubuntu kernel: [ 3831.451306]  0xffff9c000c20
Apr 12 11:40:51 ubuntu kernel: [ 3853.146667] nvme nvme0: I/O 70 QID 4 timeout, completion polled
Apr 12 11:40:51 ubuntu kernel: [ 3853.146688] nvme nvme0: I/O 158 QID 6 timeout, completion polled
Apr 12 11:40:54 ubuntu systemd[1]: systemd-logind.service: Processes still around after SIGKILL. Ignoring.
Apr 12 11:40:55 ubuntu kernel: [ 3857.502609] nvme nvme0: I/O 747 QID 3 timeout, completion polled
Apr 12 11:41:25 ubuntu kernel: [ 3887.961628] nvme nvme0: I/O 749 QID 3 timeout, completion polled
Apr 12 11:41:25 ubuntu kernel: [ 3888.013572] nvme nvme0: I/O 178 QID 6 timeout, completion polled
Apr 12 11:41:32 ubuntu kernel: [ 3894.469339] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 12 11:41:32 ubuntu kernel: [ 3894.469346] rcu: 	0-...0: (1 GPs behind) idle=3f9/1/0x4000000000000002 softirq=599505/599506 fqs=40031 
Apr 12 11:41:32 ubuntu kernel: [ 3894.469353] 	(detected by 2, t=99782 jiffies, g=1078301, q=56887)
Apr 12 11:41:32 ubuntu kernel: [ 3894.469357] Task dump for CPU 0:
Apr 12 11:41:32 ubuntu kernel: [ 3894.469358] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid: 5884 ppid:  5186 flags:0x00000806
Apr 12 11:41:32 ubuntu kernel: [ 3894.469365] Call trace:
Apr 12 11:41:32 ubuntu kernel: [ 3894.469366]  __switch_to+0x104/0x160
Apr 12 11:41:32 ubuntu kernel: [ 3894.469382]  0xffff9c000c20
Apr 12 11:41:56 ubuntu kernel: [ 3918.172656] nvme nvme0: I/O 750 QID 3 timeout, completion polled
Apr 12 11:41:57 ubuntu kernel: [ 3919.260752] nvme nvme0: I/O 869 QID 7 timeout, completion polled
Apr 12 11:42:01 ubuntu kernel: [ 3923.288594] nvme nvme0: I/O 132 QID 6 timeout, completion polled
Apr 12 11:42:24 ubuntu systemd[1]: systemd-logind.service: State 'final-sigterm' timed out. Killing.
Apr 12 11:42:24 ubuntu systemd[1]: systemd-logind.service: Killing process 538 (systemd-logind) with signal SIGKILL.
Apr 12 11:42:27 ubuntu kernel: [ 3949.399705] nvme nvme0: I/O 753 QID 3 timeout, completion polled
Apr 12 11:42:33 ubuntu kernel: [ 3955.287673] nvme nvme0: I/O 875 QID 7 timeout, completion polled
Apr 12 11:42:35 ubuntu kernel: [ 3957.487423] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Apr 12 11:42:35 ubuntu kernel: [ 3957.487430] rcu: 	0-...0: (1 GPs behind) idle=3f9/1/0x4000000000000002 softirq=599505/599506 fqs=45983 
Apr 12 11:42:35 ubuntu kernel: [ 3957.487438] 	(detected by 2, t=115537 jiffies, g=1078301, q=61607)
Apr 12 11:42:35 ubuntu kernel: [ 3957.487442] Task dump for CPU 0:
Apr 12 11:42:35 ubuntu kernel: [ 3957.487444] task:cuda-EvtHandlr  state:R  running task     stack:    0 pid: 5884 ppid:  5186 flags:0x00000806
Apr 12 11:42:35 ubuntu kernel: [ 3957.487451] Call trace:
Apr 12 11:42:35 ubuntu kernel: [ 3957.487452]  __switch_to+0x104/0x160
Apr 12 11:42:35 ubuntu kernel: [ 3957.487467]  0xffff9c000c20
Apr 12 11:42:58 ubuntu kernel: [ 3980.118798] nvme nvme0: I/O 757 QID 3 timeout, completion polled
Apr 12 11:42:58 ubuntu kernel: [ 3980.118841] nvme nvme0: I/O 137 QID 6 timeout, completion polled
Apr 12 11:43:03 ubuntu kernel: [ 3986.006784] nvme nvme0: I/O 885 QID 7 timeout, completion polled
Apr 12 11:46:00 ubuntu systemd-modules-load[279]: Inserted module 'nvmap'
Apr 12 11:46:00 ubuntu systemd-modules-load[279]: Inserted module 'nvgpu'
Apr 12 11:46:00 ubuntu systemd-modules-load[279]: Inserted module 'ina3221'
Apr 12 11:46:00 ubuntu systemd-modules-load[279]: Inserted module 'nvidia_p2p'
Apr 12 11:46:00 ubuntu kernel: [    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd421]
Apr 12 11:46:00 ubuntu kernel: [    0.000000] Linux version 5.15.148-tegra (buildbrain@mobile-u64-6336-d8000) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.08) 11.3.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Tue Jan 7 17:14:38 PST 2025 ()
Apr 12 11:46:00 ubuntu kernel: [    0.000000] Machine model: NVIDIA Jetson Orin NX Engineering Reference Developer Kit
Apr 12 11:46:00 ubuntu kernel: [    0.000000] efi: EFI v2.70 by EDK II
Apr 12 11:46:00 ubuntu kernel: [    0.000000] efi: RTPROP=0x46d82f198 TPMFinalLog=0x45e3f0000 SMBIOS=0xffff0000 SMBIOS 3.0=0x46d220000 MEMATTR=0x467153018 ESRT=0x46718a398 TPMEventLog=0x45e408018 RNG=0x45a930018 MEMRESERVE=0x45e40ac18 
Apr 12 11:46:00 ubuntu kernel: [    0.000000] random: crng init done
Apr 12 11:46:00 ubuntu kernel: [    0.000000] secureboot: Secure boot disabled
Apr 12 11:46:00 ubuntu kernel: [    0.000000] esrt: Reserving ESRT space from 0x000000046718a398 to 0x000000046718a3d0.
Apr 12 11:46:00 ubuntu kernel: [    0.000000] Reserved memory: created CMA memory pool at 0x000000044a000000, size 256 MiB
Apr 12 11:46:00 ubuntu kernel: [    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
Apr 12 11:46:00 ubuntu kernel: [    0.000000] NUMA: No NUMA configuration found

As we can see that the device “reboots” we thought that maybe we could get more info by enabling kernel debugging following these instructions: Kernel Debugging Tools — NVIDIA Jetson Linux Developer Guide 1 documentation

It didn’t help. Last night the device rebooted again and no crash log was generated.

This is the log:

Apr 16 00:04:19 orin-nx-2 kernel: [25924.573599] cpufreq: cpu0,cur:246000,set:1984000,delta:1738000,set ndiv:155
Apr 16 00:04:52 orin-nx-2 kernel: [25957.605512] cpufreq: cpu0,cur:278000,set:1984000,delta:1706000,set ndiv:155
Apr 16 00:04:53 orin-nx-2 kernel: [25958.607464] cpufreq: cpu4,cur:990000,set:1984000,delta:994000,set ndiv:155
Apr 16 00:04:55 orin-nx-2 kernel: [25960.610693] cpufreq: cpu4,cur:749000,set:1984000,delta:1235000,set ndiv:155
Apr 16 00:05:06 orin-nx-2 kernel: [25971.620831] cpufreq: cpu4,cur:246000,set:1984000,delta:1738000,set ndiv:155
Apr 16 00:05:23 orin-nx-2 kernel: [25988.635968] cpufreq: cpu4,cur:1110000,set:1984000,delta:874000,set ndiv:155
Apr 16 00:05:29 orin-nx-2 kernel: [25994.641330] cpufreq: cpu4,cur:248000,set:1984000,delta:1736000,set ndiv:155
Apr 16 00:06:31 orin-nx-2 kernel: [26056.692966] cpufreq: cpu0,cur:1776000,set:1984000,delta:208000,set ndiv:155
Apr 16 00:06:59 orin-nx-2 kernel: [26084.719688] cpufreq: cpu0,cur:1728000,set:1984000,delta:256000,set ndiv:155
Apr 16 00:07:19 orin-nx-2 kernel: [26104.737757] cpufreq: cpu4,cur:1060000,set:1984000,delta:924000,set ndiv:155
Apr 16 00:16:54 orin-nx-2 systemd-modules-load[272]: Inserted module 'nvmap'
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd421]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Linux version 5.15.148-tegra (root@SWENG5) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.08) 11.3.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Fri Dec 13 12:15:46 EST 2024 ()
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Machine model: CTI Hadron + Orin NX
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] efi: EFI v2.70 by EDK II
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] efi: RTPROP=0x46d7df198 TPMFinalLog=0x45e3a0000 SMBIOS=0xffff0000 SMBIOS 3.0=0x46d1d0000 MEMATTR=0x467112018 ESRT=0x467a7a198 TPMEventLog=0x45e3b8018 RNG=0x45a7e0018 MEMRESERVE=0x45e3bac18 
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] random: crng init done
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] secureboot: Secure boot disabled
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] esrt: Reserving ESRT space from 0x0000000467a7a198 to 0x0000000467a7a1d0.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Reserved memory: created CMA memory pool at 0x000000044a000000, size 256 MiB
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] NUMA: No NUMA configuration found
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] NUMA: Faking a node at [mem 0x0000000080000000-0x0000000477ffffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] NUMA: NODE_DATA [mem 0x4702fc800-0x4702fefff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Zone ranges:
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   DMA      [mem 0x0000000080000000-0x00000000ffffffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   DMA32    empty
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   Normal   [mem 0x0000000100000000-0x0000000477ffffff]
Apr 16 00:16:54 orin-nx-2 systemd-modules-load[272]: Inserted module 'nvgpu'
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Movable zone start for each node
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Early memory node ranges
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x0000000080000000-0x00000000bdffffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x00000000c2000000-0x00000000fffdffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x00000000fffe0000-0x00000000ffffffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x0000000100000000-0x000000045e1e6fff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x000000045e1e7000-0x000000045e3affff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x000000045e3b0000-0x000000045e3bafff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x000000045e3bb000-0x000000045e3bbfff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x000000045e3bc000-0x000000046b89ffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x000000046b8a0000-0x000000046d7dffff]
Apr 16 00:16:54 orin-nx-2 systemd-modules-load[272]: Inserted module 'ina3221'
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x000000046d7e0000-0x0000000471dfffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x0000000471e00000-0x0000000471ffffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x0000000472000000-0x000000047259ffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x0000000472f00000-0x0000000472ffffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]   node   0: [mem 0x0000000476000000-0x0000000477ffffff]
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::pageflags_layout_widths Section 0 Node 4 Zone 2 Lastcpupid 16 Kasantag 0 Flags 24
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::pageflags_layout_shifts Section 21 Node 4 Zone 2 Lastcpupid 16 Kasantag 0
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::pageflags_layout_pgshifts Section 0 Node 60 Zone 58 Lastcpupid 42 Kasantag 0
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 58
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::pageflags_layout_usage location: 64 -> 42 layout 42 -> 24 unused 24 -> 0 page-flags
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x0000000477ffffff]
Apr 16 00:16:54 orin-nx-2 systemd-modules-load[272]: Inserted module 'nvidia_p2p'
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::memmap_init Initialising map node 0 zone 0 pfns 524288 -> 1048576
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 4685824
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] On node 0, zone DMA: 16384 pages in unavailable ranges
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] On node 0, zone Normal: 2400 pages in unavailable ranges
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] On node 0, zone Normal: 12288 pages in unavailable ranges
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] crashkernel low memory reserved: 0xf7e00000 - 0xffe00000 (128 MB)
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] crashkernel reserved: 0x00000003ba200000 - 0x000000043a200000 (2048 MB)
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] psci: probing for conduit method from DT.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] psci: PSCIv1.1 detected in firmware.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] psci: Using standard PSCI v0.2 function IDs
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] psci: Trusted OS migration not required
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] psci: SMC Calling Convention v1.2
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] percpu: Embedded 29 pages/cpu s80408 r8192 d30184 u118784
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] pcpu-alloc: s80408 r8192 d30184 u118784 alloc=29*4096
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 [0] 4 [0] 5 [0] 6 [0] 7 
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Detected PIPT I-cache on CPU0
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] CPU features: detected: Address authentication (architected algorithm)
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] CPU features: detected: GIC system register CPU interface
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] CPU features: detected: Virtualization Host Extensions
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] CPU features: detected: Hardware dirty bit management
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] CPU features: detected: Spectre-v4
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] CPU features: detected: Spectre-BHB
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] CPU features: kernel page table isolation forced ON by KASLR
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] CPU features: detected: Kernel page table isolation (KPTI)
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] alternatives: patching kernel code
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::zonelist general 0:DMA = 0:DMA 
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::zonelist general 0:Normal = 0:Normal 0:DMA 
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::zonelist thisnode 0:DMA = 0:DMA 
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mminit::zonelist thisnode 0:Normal = 0:Normal 0:DMA 
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 4065440
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Policy zone: Normal
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Kernel command line: root=PARTUUID=fdfa49c2-fcac-47d2-9f4f-2e3644bf23e7 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 firmware_class.path=/etc/firmware fbcon=map:0 nospectre_bhb video=efifb:off console=tty0 crashkernel=2G bl_prof_dataptr=2031616@0x471E10000 bl_prof_ro_ptr=65536@0x471E00000 
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Unknown kernel command line parameters "bl_prof_dataptr=2031616@0x471E10000 bl_prof_ro_ptr=65536@0x471E00000", will be passed to user space.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Dentry cache hash table entries: 2097152 (order: 12, 16777216 bytes, linear)
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Inode-cache hash table entries: 1048576 (order: 11, 8388608 bytes, linear)
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] software IO TLB: mapped [mem 0x00000000f3e00000-0x00000000f7e00000] (64MB)
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] Memory: 13516768K/16521856K available (19712K kernel code, 4088K rwdata, 10120K rodata, 7744K init, 697K bss, 2742944K reserved, 262144K cma-reserved)
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] trace event string verifier disabled
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] rcu: Preemptible hierarchical RCU implementation.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] rcu:   RCU event tracing is enabled.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] rcu:   RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=8.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]        Trampoline variant of Tasks RCU enabled.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]        Rude variant of Tasks RCU enabled.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000]        Tracing variant of Tasks RCU enabled.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0

In the log we can see:

Apr 16 00:16:54 orin-nx-2 kernel: [    0.000000] crashkernel reserved: 0x00000003ba200000 - 0x000000043a200000 (2048 MB)` 

which shows that the crash log is enabled. What could cause the system to reboot without a kernel crash?

This might be completely unrelated, but if not, it is worth mentioning. We have thousands of messages cpufreq for the cpu0 and cpu4, setting them to 1984000 before the reboot. After the reboot, there is none.

There is this message further up in the log:

Apr 15 16:57:25 orin-nx-2 kernel: [  311.852908] cpufreq transition table exceeds PAGE_SIZE. Disabling

Thanks for your help,

Alex

Hi,

Thanks for sharing the logs.

Could you share the sample or steps to reproduce this issue so we can try it internally?
In your experiments, does rcu_preempt error happen once the network lost?

Thanks.