We are running a jetson NX system (8GB) with jetpack 4.6.2 (kernels are 4.9.253-tegra-32.7.2-20220417024839 and also 4.9.299-tegra-32.7.3-20221122092958), operating at maximum performance (jetson clocks) and 20W config.
Once a day or every couple of days the system simply hangs.
No SSH or console (via USB 2 UART or HDMI) connection is available, but it can respond to ping (only).
A hard reboot is necessary to bring it back to life.
From the kernel logs, nothing special is apparent, exceptional memory reports or cpu usage.
Any hints for how to approach the problem and debug it?
Or something pathological that is known and we should take into consideration?
How come that the system freezes, but can respond to ping? (maybe a direction to investigate)
If your board is still able to ping, it means your UART console shall still be alive. You should be able to dump log from it and share it here.
If your think your UART console is dead, then it is possible that the console you are operating with is not UART at all since the beginning…
For example, there is nothing called “via USB 2 UART”. USB is USB not UART.
Thank you for the reply.
It looks like a matter of time before the interfaces stop responding, so even if ping works, waiting for a while and then trying to connect via UART fails with no device, so we’ll wait for the problem to reproduce and share more details.
In the meanwhile, from one of the systems, here’s a kernel log output that starts to appear close to the time it blocks:
Dec 26 06:23:09 jetson-57 kernel: [479536.245952] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 32s!
Dec 26 06:23:09 jetson-57 kernel: [479536.246172] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 32s!
Dec 26 06:23:09 jetson-57 kernel: [479536.246352] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 32s!
Dec 26 06:23:09 jetson-57 kernel: [479536.246528] BUG: workqueue lockup - pool cpus=0-5 flags=0x4 nice=0 stuck for 32s!
Dec 26 06:23:09 jetson-57 kernel: [479536.246731] Showing busy workqueues and worker pools:
Dec 26 06:23:09 jetson-57 kernel: [479536.246737] workqueue events: flags=0x0
Dec 26 06:23:09 jetson-57 kernel: [479536.246743] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
Dec 26 06:23:09 jetson-57 kernel: [479536.246764] pending: vmstat_shepherd, rtcpu_trace_worker
Dec 26 06:23:09 jetson-57 kernel: [479536.246800] workqueue events_power_efficient: flags=0x80
Dec 26 06:23:09 jetson-57 kernel: [479536.246829] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Dec 26 06:23:09 jetson-57 kernel: [479536.246848] pending: neigh_periodic_work
Dec 26 06:23:09 jetson-57 kernel: [479536.246862] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Dec 26 06:23:09 jetson-57 kernel: [479536.246879] pending: nf_conntrack_tuple_taken [nf_conntrack]
Dec 26 06:23:09 jetson-57 kernel: [479536.246935] workqueue writeback: flags=0x4e
Dec 26 06:23:09 jetson-57 kernel: [479536.246939] pwq 12: cpus=0-5 flags=0x4 nice=0 active=1/256 refcnt=3
Dec 26 06:23:09 jetson-57 kernel: [479536.246955] pending: wb_workfn
Dec 26 06:23:09 jetson-57 kernel: [479536.246975] workqueue devfreq_wq: flags=0x6000e
Dec 26 06:23:09 jetson-57 kernel: [479536.246979] pwq 12: cpus=0-5 flags=0x4 nice=0 active=1/1 refcnt=3
Dec 26 06:23:09 jetson-57 kernel: [479536.246993] pending: devfreq_monitor
Dec 26 06:23:09 jetson-57 kernel: [479536.247010] workqueue vmstat: flags=0xc
Dec 26 06:23:09 jetson-57 kernel: [479536.247014] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Dec 26 06:23:09 jetson-57 kernel: [479536.247029] pending: vmstat_update
Dec 26 06:23:09 jetson-57 kernel: [479536.247041] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Dec 26 06:23:09 jetson-57 kernel: [479536.247060] pending: vmstat_update
Beforehand, everything looks normal, and from the first time this log block appears, it repeats every 30 seconds approximately, and there is another report from time to time:
Dec 26 06:31:20 jetson-57 kernel: [480027.761282] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 523s!
Dec 26 06:31:20 jetson-57 kernel: [480027.761510] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 523s!
Dec 26 06:31:20 jetson-57 kernel: [480027.761691] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 523s!
Dec 26 06:31:20 jetson-57 kernel: [480027.761868] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 517s!
Dec 26 06:31:20 jetson-57 kernel: [480027.762075] BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 425s!
Dec 26 06:31:20 jetson-57 kernel: [480027.762257] BUG: workqueue lockup - pool cpus=0-5 flags=0x4 nice=0 stuck for 449s!
Dec 26 06:31:20 jetson-57 kernel: [480027.762453] Showing busy workqueues and worker pools:
Dec 26 06:31:20 jetson-57 kernel: [480027.762459] workqueue events: flags=0x0
Dec 26 06:31:20 jetson-57 kernel: [480027.762465] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
Dec 26 06:31:20 jetson-57 kernel: [480027.762484] pending: vmstat_shepherd, rtcpu_trace_worker
Dec 26 06:31:20 jetson-57 kernel: [480027.762517] workqueue events_power_efficient: flags=0x80
Dec 26 06:31:20 jetson-57 kernel: [480027.762521] pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Dec 26 06:31:20 jetson-57 kernel: [480027.762537] pending: check_lifetime
Dec 26 06:31:20 jetson-57 kernel: [480027.762551] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Dec 26 06:31:20 jetson-57 kernel: [480027.762567] pending: neigh_periodic_work
Dec 26 06:31:20 jetson-57 kernel: [480027.762582] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256 refcnt=3
Dec 26 06:31:20 jetson-57 kernel: [480027.762597] pending: nf_conntrack_tuple_taken [nf_conntrack], sync_cmos_clock
Dec 26 06:31:20 jetson-57 kernel: [480027.762651] workqueue writeback: flags=0x4e
Dec 26 06:31:20 jetson-57 kernel: [480027.762682] pwq 12: cpus=0-5 flags=0x4 nice=0 active=1/256 refcnt=3
Dec 26 06:31:20 jetson-57 kernel: [480027.762697] pending: wb_workfn
Dec 26 06:31:20 jetson-57 kernel: [480027.762714] workqueue memcg_kmem_cache_create: flags=0xa0002
Dec 26 06:31:20 jetson-57 kernel: [480027.762717] pwq 12: cpus=0-5 flags=0x4 nice=0 active=1/1 refcnt=12
Dec 26 06:31:20 jetson-57 kernel: [480027.762731] pending: memcg_kmem_cache_create_func
Dec 26 06:31:20 jetson-57 kernel: [480027.762743] delayed: memcg_kmem_cache_create_func, memcg_kmem_cache_create_func, memcg_kmem_cache_create_func, memcg_kmem_cache_create_func, memcg_kmem_cache_create_func, memcg_kmem_cache_create_func, memcg_kmem_cache_create_func, memcg_kmem_cache_create_func, memcg_kmem_cache_create_func
Dec 26 06:31:20 jetson-57 kernel: [480027.762790] workqueue devfreq_wq: flags=0x6000e
Dec 26 06:31:20 jetson-57 kernel: [480027.762818] pwq 12: cpus=0-5 flags=0x4 nice=0 active=1/1 refcnt=3
Dec 26 06:31:20 jetson-57 kernel: [480027.762832] pending: devfreq_monitor
Dec 26 06:31:20 jetson-57 kernel: [480027.762849] workqueue vmstat: flags=0xc
Dec 26 06:31:20 jetson-57 kernel: [480027.762853] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Dec 26 06:31:20 jetson-57 kernel: [480027.762868] pending: vmstat_update
Dec 26 06:31:20 jetson-57 kernel: [480027.762879] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Dec 26 06:31:20 jetson-57 kernel: [480027.762895] pending: vmstat_update
Dec 26 06:31:20 jetson-57 kernel: [480027.762925] workqueue ipv6_addrconf: flags=0x40008
Dec 26 06:31:20 jetson-57 kernel: [480027.762930] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=2
Dec 26 06:31:20 jetson-57 kernel: [480027.762947] pending: addrconf_verify_work
These logs do not help. They are not 100% related to the error you are reporting.
The method you are using for UART is also wrong. You should enable and monitor the UART log since the beginning. But not “oh the error is happened, I have to connect UART now”. It is too late to capture log in such way.