Hello!
We are running into an issue where we get a Kernel Lockup intermittently on an Orin AGX (example log below).
- We have seen this issue using both L4T 35.3.1 and L4T 35.6.2
- Increasing vm.min_free_kbytes seems to reduce the frequency of occurrence. Decreasing to 50MB gets the issue to happen pretty reliably after ~20minutes. Increasing to 8GB or 16GB seems to reduce the occurrence to every few hours but not cured completely.
- This only occurs when using applications that utilize the GPU.
- We have not been able to get reliable reproduction steps or an example exhibiting the problem unfortunately
- There are few similar threads on the forums here, but none with a clear resolution.
- When the lockup occurs, the system becomes unusable and requires a powercycle to regain functionality.
Any help or insight would be appreciated!
[ 34.655758] Adding 2678708k swap on /dev/zram11. Priority:5 extents:1 across:2678708k SS
[ 523.231403] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 31s!
[ 523.231403] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 31s!
[ 523.231685] Showing busy workqueues and worker pools:
[ 523.231691] workqueue events: flags=0x0
[ 523.231697] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256 refcnt=4
[ 523.231709] pending: vmpressure_work_fn, free_work, kfree_rcu_monitor
[ 523.231745] workqueue rcu_gp: flags=0x8
[ 523.231748] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 523.231753] pending: wait_rcu_exp_gp
[ 523.231760] workqueue mm_percpu_wq: flags=0x8
[ 523.231763] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256 refcnt=6
[ 523.231768] pending: drain_local_pages_wq BAR(2626), vmstat_update, lru_add_drain_per_cpu BAR(85)
[ 523.231785] workqueue pm: flags=0x4
[ 523.231787] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 523.231792] in-flight: 25:pm_runtime_work
[ 523.231799] workqueue cgroup_destroy: flags=0x0
[ 523.231801] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/1 refcnt=2
[ 523.231806] pending: css_release_work_fn
[ 523.231853] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=31s workers=3 idle: 3908 431