Hi,
Setup, Xavier NX with JP5.1.3, fpdlink-iii and 8 imx219.
Found an issue where under certain conditions argus won’t start back up again, here are the kernel logs:
[27671.646943] task:ViCsiHw frameSt state:D stack: 0 pid:230904 ppid: 1 flags:0x00000001
[27671.646989] Call trace:
[27671.647010] __switch_to+0xc8/0x120
[27671.647021] __schedule+0x318/0x980
[27671.647028] schedule+0x78/0x110
[27671.647056] schedule_timeout+0x2dc/0x340
[27671.647076] wait_for_completion+0x8c/0x120
[27671.647088] vi_capture_status+0xac/0x130
[27671.647095] vi_channel_ioctl+0x2c4/0x8f0
[27671.647104] __arm64_sys_ioctl+0xac/0xf0
[27671.647113] el0_svc_common.constprop.0+0x80/0x1d0
[27671.647120] do_el0_svc+0x38/0xc0
[27671.647127] el0_svc+0x1c/0x30
[27671.647133] el0_sync_handler+0xa8/0xb0
[27671.647140] el0_sync+0x16c/0x180
It complains about ‘Hung task’ for x time. This basically causes this state:
user@vision:~$ ps aux | grep argus
root 230695 0.1 0.0 0 0 ? Zsl 13:15 0:05 [nvargus-daemon] <defunct>
root 253690 0.0 0.1 55684 10856 ? Ss 14:10 0:00 /usr/sbin/nvargus-daemon
user 253866 0.0 0.0 10300 664 pts/0 S+ 14:11 0:00 grep --color=auto argus
Where ‘nvargus-daemon’ will not die, and causes any attempt to restart ‘/usr/sbin/nvargus-daemon’ to fail. The only solution is to reboot the board, and we are trying to avoid that at all costs.
The solution that I think could solve it is this:
--- a/kernel/nvidia/drivers/media/platform/tegra/camera/fusa-capture/capture-vi.c
+++ b/kernel/nvidia/drivers/media/platform/tegra/camera/fusa-capture/capture-vi.c
@@ -1498,21 +1498,20 @@ int vi_capture_status(
return -ENODEV;
}
- dev_dbg(chan->dev, "%s: waiting for status, timeout:%d ms\n",
+ dev_err(chan->dev, "%s: waiting for status, timeout:%d ms\n",
__func__, timeout_ms);
/* negative timeout means wait forever */
if (timeout_ms < 0) {
- wait_for_completion(&capture->capture_resp);
- } else {
- ret = wait_for_completion_timeout(
- &capture->capture_resp,
- msecs_to_jiffies(timeout_ms));
- if (ret == 0) {
- dev_dbg(chan->dev,
- "capture status timed out\n");
- return -ETIMEDOUT;
- }
+ timeout_ms = 100;
+ }
+ ret = wait_for_completion_timeout(
+ &capture->capture_resp,
+ msecs_to_jiffies(timeout_ms));
+ if (ret == 0) {
+ dev_err(chan->dev,
+ "capture status timed out\n");
+ return -ETIMEDOUT;
}
if (ret < 0) {
Basically the fix is to always have a timeout(added 100ms as placeholder wanted to check via the print what are normal values), and don’t wait forever. I don’t really care if argus crashes due to that ‘ETIMEDOUT’, but I want to be able to restart it back up and don’t leave zombie processes. But I cannot properly test it since I’m not sure what calls the ‘NVHOST_VI_GET_CAPTURE_STATUS’ interrupt, which in theory calls that method. And via normal capture, so far I have not been able to get it to enter there.
Does somebody knows how I can trigger that method?
Regards,
Andres
Embedded SW Engineer at RidgeRun
Contact us: support@ridgerun.com
Developers wiki: https://developer.ridgerun.com
Website: www.ridgerun.com