JetPack 5.1

I encounter the same “workqueue lockup” error described in various other topics on my Xavier with Jetpack5. In my case the fix for the issue was to disable power control of the gpu (echo on > /sys/devices/gpu.0/power/control). My question is, is it save to do? And will it effect the expected life of gpu? My device is expected to run 3 years constantly and I am wondering it disabling power control will have impact or not?

Hi,

I think you just make sure the device is placed with proper cooling and it’d be fine.
Also, doesn’t it mean enabling rather than disabling with your command?

based on the docs, the choice is between “auto” on “on”. The cooling wont be an issue in our case.

Thanks you for the reply. In this case we go further without power management on the gpu. btw Do you happen to know if there is any place I read more about have this feature is implemented (what it does) in hardware?

https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-power

`
The /sys/devices/…/power/control attribute allows the user
space to control the run-time power management of the device.

	All devices have one of the following two values for the
	power/control file:

	+ "auto\n" to allow the device to be power managed at run time;
	+ "on\n" to prevent the device from being power managed;

	The default for all devices is "auto", which means that they may
	be subject to automatic power management, depending on their
	drivers.  Changing this attribute to "on" prevents the driver
	from power managing the device at run time.  Doing that while
	the device is suspended causes it to be woken up.

`

Oh sure.
I think this power management is built-in in Linux, but not by NVIDIA, so you may need to do more research on how Linux handle this.
Generally, I think it’d be working fine unless it runs into overcurrent or overheating.

Thank you.

Is there any generic fix for the “workqueue lockup” likes problem? It looks like few people have the same issue. We have device with three cameras. Everything was working fine on the jetpack4, but when we moved to jetpack5 we hit this strange behavior were CPU is occupied with 100% on some kernel thread which goes in the infinity loop.

Out kernel log:

workqueue lockupdate 12:05:45 host kernel: [  605.134716] Call trace:
date 12:05:45 host kernel: [  605.134736]  __switch_to+0xc8/0x120
date 12:05:45 host kernel: [  605.134776]  __schedule+0x3d0/0x910
date 12:05:45 host kernel: [  605.134788]  schedule+0x78/0x110
date 12:05:45 host kernel: [  605.134799]  schedule_timeout+0x2dc/0x340
date 12:05:45 host kernel: [  605.134809]  wait_for_completion+0x8c/0x120
date 12:05:45 host kernel: [  605.134820]  __flush_work.isra.0+0x108/0x220
date 12:05:45 host kernel: [  605.134830]  flush_work+0x24/0x30
date 12:05:45 host kernel: [  605.134869]  lru_add_drain_all+0x1a0/0x210
date 12:05:45 host kernel: [  605.134889]  khugepaged+0xa0/0x1df0
date 12:05:45 host kernel: [  605.134899]  kthread+0x148/0x170
date 12:05:45 host kernel: [  605.134909]  ret_from_fork+0x10/0x24
date 12:05:45 host kernel: [  605.134985] INFO: task gnome-shell:2962 blocked for more than 241 seconds.
date 12:05:45 host kernel: [  605.135140]       Tainted: G           OE     5.10.104-tegra #1
date 12:05:45 host kernel: [  605.135264] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
date 12:05:45 host kernel: [  605.135428] task:gnome-shell     state:D stack:    0 pid: 2962 ppid:  2932 flags:0x00000000
date 12:05:45 host kernel: [  605.135443] Call trace:
date 12:05:45 host kernel: [  605.135455]  __switch_to+0xc8/0x120
date 12:05:45 host kernel: [  605.135466]  __schedule+0x3d0/0x910
date 12:05:45 host kernel: [  605.135475]  schedule+0x78/0x110
date 12:05:45 host kernel: [  605.135487]  rpm_resume+0x160/0x750
date 12:05:45 host kernel: [  605.135497]  __pm_runtime_resume+0x44/0x90
date 12:05:45 host kernel: [  605.135765]  gk20a_busy+0x130/0x200 [nvgpu]
date 12:05:45 host kernel: [  605.136018]  nvgpu_submit_channel_gpfifo+0xf8/0x650 [nvgpu]
date 12:05:45 host kernel: [  605.136259]  nvgpu_submit_channel_gpfifo_user+0x84/0x100 [nvgpu]
date 12:05:45 host kernel: [  605.136499]  gk20a_channel_ioctl+0xee8/0x1210 [nvgpu]
date 12:05:45 host kernel: [  605.136513]  __arm64_sys_ioctl+0xac/0xf0
date 12:05:45 host kernel: [  605.136526]  el0_svc_common.constprop.0+0x80/0x1d0
date 12:05:45 host kernel: [  605.136536]  do_el0_svc+0x38/0xb0
date 12:05:45 host kernel: [  605.136546]  el0_svc+0x1c/0x30
date 12:05:45 host kernel: [  605.136555]  el0_sync_handler+0xa8/0xb0
date 12:05:45 host kernel: [  605.136564]  el0_sync+0x16c/0x180
date 12:05:45 host kernel: [  605.136581] INFO: task :3257 blocked for more than 241 seconds.
date 12:05:45 host kernel: [  605.136809]       Tainted: G           OE     5.10.104-tegra #1
date 12:05:45 host kernel: [  605.136984] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
date 12:05:45 host kernel: [  605.137184] task:app   state:D stack:    0 pid: 3257 ppid:     1 flags:0x00000000
date 12:05:45 host kernel: [  605.137199] Call trace:
date 12:05:45 host kernel: [  605.137210]  __switch_to+0xc8/0x120
date 12:05:45 host kernel: [  605.137220]  __schedule+0x3d0/0x910
date 12:05:45 host kernel: [  605.137232]  schedule+0x78/0x110
date 12:05:45 host kernel: [  605.137243]  rwsem_down_read_slowpath+0x218/0x4c0
date 12:05:45 host kernel: [  605.137253]  down_read+0xb0/0xd0
date 12:05:45 host kernel: [  605.137264]  do_page_fault+0xd8/0x400
date 12:05:45 host kernel: [  605.137274]  do_mem_abort+0x54/0xb0
date 12:05:45 host kernel: [  605.137313]  el0_ia+0x60/0xb0
date 12:05:45 host kernel: [  605.137325]  el0_sync_handler+0x90/0xb0
date 12:05:45 host kernel: [  605.137333]  el0_sync+0x16c/0x180
date 12:05:45 host kernel: [  605.137342] INFO: task :3259 blocked for more than 241 seconds.
date 12:05:45 host kernel: [  605.137551]       Tainted: G           OE     5.10.104-tegra #1
date 12:05:45 host kernel: [  605.137783] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
date 12:05:45 host kernel: [  605.137990] task:app   state:D stack:    0 pid: 3259 ppid:     1 flags:0x00000008
date 12:05:45 host kernel: [  605.138004] Call trace:
date 12:05:45 host kernel: [  605.138146]  __switch_to+0xc8/0x120
date 12:05:45 host kernel: [  605.138169]  __schedule+0x3d0/0x910
date 12:05:45 host kernel: [  605.138194]  schedule+0x78/0x110
date 12:05:45 host kernel: [  605.138247]  schedule_timeout+0x2dc/0x340
date 12:05:45 host kernel: [  605.138271]  wait_for_completion+0x8c/0x120
date 12:05:45 host kernel: [  605.138343]  __flush_work.isra.0+0x108/0x220
date 12:05:45 host kernel: [  605.138366]  flush_work+0x24/0x30
date 12:05:45 host kernel: [  605.138388]  drain_all_pages+0x17c/0x260
date 12:05:45 host kernel: [  605.138411]  __alloc_pages_slowpath.constprop.0+0x378/0xba0
date 12:05:45 host kernel: [  605.138433]  __alloc_pages_nodemask+0x2a0/0x320
date 12:05:45 host kernel: [  605.138453]  do_huge_pmd_anonymous_page+0x134/0x900
date 12:05:45 host kernel: [  605.138477]  handle_mm_fault+0x9a8/0x1000
date 12:05:45 host kernel: [  605.138499]  do_page_fault+0x118/0x400
date 12:05:45 host kernel: [  605.138543]  do_translation_fault+0x7c/0x90
date 12:05:45 host kernel: [  605.138563]  do_mem_abort+0x54/0xb0
date 12:05:45 host kernel: [  605.138581]  el0_da+0x38/0x50
date 12:05:45 host kernel: [  605.138603]  el0_sync_handler+0x80/0xb0
date 12:05:45 host kernel: [  605.138623]  el0_sync+0x16c/0x180

Apply this patch to verify.

0001-vi5-continue-captures-even-after-corr-errors.patch (2.3 KB)