Orin AGX Bug: Workqueue Lockup When Using GPU

Hello!

We are running into an issue where we get a Kernel Lockup intermittently on an Orin AGX (example log below).

  • We have seen this issue using both L4T 35.3.1 and L4T 35.6.2
  • Increasing vm.min_free_kbytes seems to reduce the frequency of occurrence. Decreasing to 50MB gets the issue to happen pretty reliably after ~20minutes. Increasing to 8GB or 16GB seems to reduce the occurrence to every few hours but not cured completely.
  • This only occurs when using applications that utilize the GPU.
  • We have not been able to get reliable reproduction steps or an example exhibiting the problem unfortunately
  • There are few similar threads on the forums here, but none with a clear resolution.
  • When the lockup occurs, the system becomes unusable and requires a powercycle to regain functionality.

Any help or insight would be appreciated!

[   34.655758] Adding 2678708k swap on /dev/zram11.  Priority:5 extents:1 across:2678708k SS
[  523.231403] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 31s!
[  523.231403] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 31s!
[  523.231685] Showing busy workqueues and worker pools:
[  523.231691] workqueue events: flags=0x0
[  523.231697]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256 refcnt=4
[  523.231709]     pending: vmpressure_work_fn, free_work, kfree_rcu_monitor
[  523.231745] workqueue rcu_gp: flags=0x8
[  523.231748]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  523.231753]     pending: wait_rcu_exp_gp
[  523.231760] workqueue mm_percpu_wq: flags=0x8
[  523.231763]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256 refcnt=6
[  523.231768]     pending: drain_local_pages_wq BAR(2626), vmstat_update, lru_add_drain_per_cpu BAR(85)
[  523.231785] workqueue pm: flags=0x4
[  523.231787]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  523.231792]     in-flight: 25:pm_runtime_work
[  523.231799] workqueue cgroup_destroy: flags=0x0
[  523.231801]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/1 refcnt=2
[  523.231806]     pending: css_release_work_fn
[  523.231853] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=31s workers=3 idle: 3908 431
1 Like

This seems similar with a reported “pm_runtime_work” in-flight workqueue item: Agx orin(jetpack5.1.2) report errors "BUG: workqueue lockup" - #8 by 1712127445

Does anyone know what pm_runtime_work is?

This problem is still unsolved.

Hi,

There is no conclusion on the topic you mentioned since we don’t have a way to reproduce this.

We have not been able to get reliable reproduction steps or an example exhibiting the problem unfortunately

Could you help us find a way to reproduce this? Or is this only reproducible with some internal source?

Thanks.

Thanks for checking in @AastaLLL

Unfortunately, we have only been able to reproduce this when running with our own internal source. We don’t have a simple reproduction step to provide unfortunately.

We will tried turning on the debug logging you mentioned in another thread and see if we can capture a kernlog: echo 0x20 > /sys/kernel/debug/gpu.0/log_mask

Interestingly, we don’t ever see this problem on an Orin NX running the same L4T and same source code.

Thanks again,
-Pierce

Hi All (and especially @AastaLLL)

Here is a captured kernlog with the extra GPU debugging enabled:
captured_kern.log (8.1 MB)

Hopefully that helps yield some clues.

Hi,

Thanks for the update.
We will give it a check to see if any clues.

Are you able to share some details about the use case?
For example, what kind of CUDA kernel is in your code?
Is this a multi-threading or multi-process scenario?

Thanks.

Hi @AastaLLL
We are currently using CUDA 11.4
Yes this is a multi-threaded application, but only 1 application is using the GPU.

Thanks again!
-Pierce