Tegra channel error recovering on Xavier when capturing with v4l2

ACervantes · June 5, 2020, 3:48pm

Hi all,

I created a driver for Xavier and it is working very well capturing the incoming buffers. But I am seeing an interesting behavior when I try to capture but the buffers are not going to the CSI port. I am using v4l2-ctl to capture the stream.

Basically, if there are no buffers coming to the CSI the capture subsystem crashes and v4l2-ctl hang and the Xavier needs to be rebooted.

I would expect that the capture subsytem handles that kind of situations, here is the log when the issue happens:

[62376.103061] tegra194-vi5 15c10000.vi: no reply from camera processor
[62376.103225] tegra194-vi5 15c10000.vi: uncorr_err: request timed out after 2500 ms
[62376.103411] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[62376.106113] tegra194-vi5 15c10000.vi: err_rec: successfully reset the capture channel
[62378.663060] tegra194-vi5 15c10000.vi: no reply from camera processor
[62378.663249] tegra194-vi5 15c10000.vi: uncorr_err: request timed out after 2500 ms
[62378.663441] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[62378.663627] tegra194-vi5 15c10000.vi: unexpected response from camera processor
[62378.663761] video4linux video0: vi capture release failed
[62378.663889] tegra194-vi5 15c10000.vi: fatal: error recovery failed
[62467.405499] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[62467.405651] Mem abort info:
[62467.405703]   ESR = 0x96000005
[62467.405757]   Exception class = DABT (current EL), IL = 32 bits
[62467.405852]   SET = 0, FnV = 0
[62467.405908]   EA = 0, S1PTW = 0
[62467.405964] Data abort info:
[62467.406016]   ISV = 0, ISS = 0x00000005
[62467.406079]   CM = 0, WnR = 0
[62467.406146] user pgtable: 4k pages, 39-bit VAs, pgd = ffffffc335ef4000
[62467.406257] [0000000000000000] *pgd=0000000000000000, *pud=0000000000000000
[62467.406405] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[62467.406501] Modules linked in: ar_camera_agnostic ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bnep fuse zram overlay b53_mdio b53_common dsa_core spidev nvgpu bluedroid_pm ip_tables x_tables
[62467.408239] CPU: 1 PID: 1178 Comm: v4l2-ctl Not tainted 4.9.140tegra #32
[62467.408729] Hardware name: NVIDIA Jetson AGX Xavier 8GB Developer Kit (DT)
[62467.415917] task: ffffffc36a740e00 task.stack: ffffffc33a648000
[62467.421522] PC is at exit_creds+0x2c/0x80
[62467.425961] LR is at __put_task_struct+0x4c/0x148
[62467.430333] pc : [<ffffff80080deba4>] lr : [<ffffff80080b069c>] pstate: 60400045
[62467.437765] sp : ffffffc33a64ba10
[62467.441003] x29: ffffffc33a64ba10 x28: ffffffc36a740e00 
[62467.446514] x27: 0000000000000009 x26: ffffffc393038cc0 
[62467.452288] x25: ffffffc33a762310 x24: 0000000000000001 
[62467.457539] x23: ffffffc3b7c62048 x22: ffffffc3d9ee2b58 
[62467.462700] x21: ffffffc3ea346230 x20: 0000000000000000 
[62467.467864] x19: ffffffc3ea346200 x18: 0000000000000046 
[62467.473552] x17: 0000007f832d1f60 x16: ffffffbf0d176120 
[62467.479326] x15: 0000000000033800 x14: 0000000000000001 
[62467.484928] x13: 0000000000000551 x12: 0000000000000612 
[62467.490440] x11: 000000000000000b x10: 0000000000000a20 
[62467.496219] x9 : ffffffc33a64b850 x8 : ffffffc36a741880 
[62467.501990] x7 : fefefeff646c606d x6 : 000001c00c9e5e40 
[62467.507502] x5 : 0000000000000800 x4 : 0000000000000000 
[62467.512841] x3 : 00000000000000d8 x2 : 0000000000000000 
[62467.518177] x1 : 0000000000000000 x0 : 00000000ffffffff 

[62467.524938] Process v4l2-ctl (pid: 1178, stack limit = 0xffffffc33a648000)
[62467.531571] Call trace:
[62467.533949] [<ffffff80080deba4>] exit_creds+0x2c/0x80
[62467.538751] [<ffffff80080b069c>] __put_task_struct+0x4c/0x148
[62467.544092] [<ffffff80080dc6b0>] kthread_stop+0x1e0/0x1e8
[62467.548910] [<ffffff8008b20a18>] vi5_channel_stop_kthreads+0x40/0x58
[62467.555286] [<ffffff8008b20ac0>] vi5_channel_stop_streaming+0x90/0xb0
[62467.561158] [<ffffff8008b139a4>] tegra_channel_stop_streaming+0x34/0x48
[62467.567626] [<ffffff8008b0be04>] __vb2_queue_cancel+0x3c/0x170
[62467.573222] [<ffffff8008b0d22c>] vb2_core_queue_release+0x2c/0x58
[62467.579082] [<ffffff8008b0f840>] _vb2_fop_release+0x88/0xa8
[62467.584420] [<ffffff8008b15848>] tegra_channel_close+0x58/0x128
[62467.590109] [<ffffff8008ae989c>] v4l2_release+0x4c/0x98
[62467.595184] [<ffffff800825a8fc>] __fput+0x94/0x1d0
[62467.600000] [<ffffff800825aab0>] ____fput+0x20/0x30
[62467.604828] [<ffffff80080d9840>] task_work_run+0xc0/0xe0
[62467.610341] [<ffffff80080b98d0>] do_exit+0x2b8/0x9d8
[62467.614968] [<ffffff80080ba07c>] do_group_exit+0x3c/0xa0
[62467.620560] [<ffffff80080c76b4>] get_signal+0x29c/0x590
[62467.625735] [<ffffff800808b0c8>] do_signal+0x168/0x4e0
[62467.630974] [<ffffff800808b5b8>] do_notify_resume+0x90/0xb0
[62467.636317] [<ffffff8008083754>] work_pending+0x8/0x10
[62467.641391] ---[ end trace e71e96af6eea7fb3 ]---
[62467.654271] Fixing recursive fault but reboot is needed!

I think the TX2 does not have this issue anymore but it is because TX2 uses vi4 instead of vi5, vi4 handles those kind of problems.

Sometimes we would need to have the system trying to capture even if the buffers are not coming or at least does not have the system crashing and needing a reboot.

Do you have any ideas on how we could handle this situation?

Thanks,
-Adrian

JerryChang · June 8, 2020, 2:10am

hello ACervantes,

it’s true that TX2 and Xavier were using different VI drivers; TX2 working with VI-4 and Xavier using VI-5.

as you can see, here’s software recover mechanism triggered.

[62376.103411] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[62376.106113] tegra194-vi5 15c10000.vi: err_rec: successfully reset the capture channel

may I have your confirmation, did you try to disconnect the physical signal during sensor streaming?
or, could you please share the whole reproduce steps for reference.
thanks

ACervantes · June 11, 2020, 5:17pm

Hi JerryChang,

Thanks for your response,

I was able to reproduce the issue with the Xavier EVM kit and ov5693 sensor.

I modified a little bit the ov5693 driver in order to avoid writing stream on register for the sensor. so basically you are not going to have buffers coming to Xavier CSI.

I would expect that the system fails when trying to capture, but V4L hangs and the system needs to be rebooted when I receive the error fatal: error recovery failed .

these are the messages that I see:

[   38.485515] tegra194-vi5 15c10000.vi: no reply from camera processor
[   38.485685] tegra194-vi5 15c10000.vi: uncorr_err: request timed out after 2500 ms
[   38.485839] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[   38.488524] tegra194-vi5 15c10000.vi: err_rec: successfully reset the capture channel
[   41.045510] tegra194-vi5 15c10000.vi: no reply from camera processor
[   41.045678] tegra194-vi5 15c10000.vi: uncorr_err: request timed out after 2500 ms
[   41.045860] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[   41.046021] tegra194-vi5 15c10000.vi: unexpected response from camera processor
[   41.046191] video4linux video0: vi capture release failed
[   41.046294] tegra194-vi5 15c10000.vi: fatal: error recovery failed
[   60.861883] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[   60.862111] Mem abort info:
[   60.862215]   ESR = 0x96000005
[   60.862292]   Exception class = DABT (current EL), IL = 32 bits
[   60.862410]   SET = 0, FnV = 0
[   60.862477]   EA = 0, S1PTW = 0
[   60.862557] Data abort info:
[   60.862625]   ISV = 0, ISS = 0x00000005
[   60.862702]   CM = 0, WnR = 0
[   60.862774] user pgtable: 4k pages, 39-bit VAs, pgd = ffffffc3a8045000
[   60.862995] [0000000000000000] *pgd=0000000000000000, *pud=0000000000000000
[   60.863165] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[   60.863274] Modules linked in: ov5693 bnep fuse zram overlay spidev nvgpu bluedroid_pm ip_tables x_tables
[   60.863622] CPU: 0 PID: 7977 Comm: v4l2-ctl Not tainted 4.9.140-tegra #1
[   60.863741] Hardware name: Jetson-AGX (DT)
[   60.863820] task: ffffffc3a819d400 task.stack: ffffffc3c5f1c000
[   60.864199] PC is at exit_creds+0x2c/0x78
[   60.864504] LR is at __put_task_struct+0x4c/0x140
[   60.864865] pc : [<ffffff80080def7c>] lr : [<ffffff80080b01ac>] pstate: 60400045
[   60.869914] sp : ffffffc3c5f1fa10
[   60.873137] x29: ffffffc3c5f1fa10 x28: 0000000000000008 
[   60.878667] x27: ffffff8008f72000 x26: ffffffc3c5f1fde8 
[   60.884267] x25: ffffffc3e13bc9e8 x24: ffffffc3d711e518 
[   60.890107] x23: 0000000000000001 x22: ffffffc3e7906018 
[   60.895532] x21: ffffffc3e0954630 x20: 0000000000000000 
[   60.900624] x19: ffffffc3e0954600 x18: 0000000009112827 
[   60.906121] x17: 0000007f8528e698 x16: 0000000000000000 
[   60.911805] x15: 0000000000000000 x14: 0000000003369153 
[   60.917248] x13: 0000007f857f67a8 x12: 0000000000000000 
[   60.922759] x11: 00000000001192d4 x10: 0000000000000a10 
[   60.928532] x9 : ffffffc3c5f1f860 x8 : 0000000000000000 
[   60.934551] x7 : ffffff8009537000 x6 : 0000000000000001 
[   60.940063] x5 : ffffff8008163740 x4 : ffffffbf0f9e4bd0 
[   60.945153] x3 : 0000000000000001 x2 : 0000000000000000 
[   60.950737] x1 : 0000000000000000 x0 : 00000000ffffffff 

[   60.957486] Process v4l2-ctl (pid: 7977, stack limit = 0xffffffc3c5f1c000)
[   60.964128] Call trace:
[   60.966499] [<ffffff80080def7c>] exit_creds+0x2c/0x78
[   60.971310] [<ffffff80080b01ac>] __put_task_struct+0x4c/0x140
[   60.976392] [<ffffff80080dca3c>] kthread_stop+0x1e4/0x1e8
[   60.981467] [<ffffff8008b48e98>] vi5_channel_stop_kthreads+0x40/0x58
[   60.987589] [<ffffff8008b48f3c>] vi5_channel_stop_streaming+0x8c/0xa8
[   60.993973] [<ffffff8008b3b8ac>] tegra_channel_stop_streaming+0x34/0x48
[   60.999927] [<ffffff8008b33bbc>] __vb2_queue_cancel+0x34/0x188
[   61.005782] [<ffffff8008b350e4>] vb2_core_queue_release+0x2c/0x58
[   61.011644] [<ffffff8008b37764>] _vb2_fop_release+0x84/0xa0
[   61.016726] [<ffffff8008b3d224>] tegra_channel_close+0x64/0x140
[   61.022415] [<ffffff8008b10d18>] v4l2_release+0x48/0xa0
[   61.027748] [<ffffff800825ef20>] __fput+0x90/0x1d0
[   61.032559] [<ffffff800825f0d8>] ____fput+0x20/0x30
[   61.037115] [<ffffff80080d9bf4>] task_work_run+0xbc/0xd8
[   61.042887] [<ffffff80080b9674>] do_exit+0x2c4/0xa08
[   61.047785] [<ffffff80080b9e48>] do_group_exit+0x40/0xa8
[   61.053122] [<ffffff80080c7744>] get_signal+0x26c/0x578
[   61.058028] [<ffffff800808b150>] do_signal+0x130/0x500
[   61.063535] [<ffffff800808b698>] do_notify_resume+0x90/0xb0
[   61.068875] [<ffffff800808379c>] work_pending+0x8/0x10
[   61.073955] ---[ end trace 90a52d796f5f18d0 ]---
[   61.094102] Fixing recursive fault but reboot is needed!

v4l hangs:

nvidia@nvidia:~$ v4l2-ctl -d /dev/video0 --set-ctrl bypass_mode=0 --stream-mmap
^C^C^C^C^C^C^C^C^C

and these are the steps to reproduce the issue:

Modify ov5693 device driver, commenting this line (start_streaming function) to avoid stream on:
err = ov5693_write_table(priv, mode_table[OV5693_MODE_START_STREAM]);
Load the modified module:
sudo insmod ov5693.ko
run v4l command:
v4l2-ctl -d /dev/video0 --set-ctrl bypass_mode=0 --stream-mmap
check the dmesg after trying to capture and the fatal error will appear. v4l command will hang.

Please let me know if you have any ideas about what the capture subsystem is not being recovered and fails with fatal error and then the system needs to be rebooted.

Thanks,
-Adrian

JerryChang · June 18, 2020, 9:24am

hello ACervantes,

since we had recently fix the error paths in vi5_channel_start_streaming() API based-on l4t-r32.4.2
may I know which JetPack release you’re working with?
I could release kernel patch for your testing if you’re based-on latest JetPack release,
thanks

ACervantes · June 18, 2020, 6:20pm

Hi JerryChang,

We are using JP 4.4 (I think this is the last JP you have released).

Please share the patch so we can apply it and check if the issue was fixed.

Thank you.

JerryChang · June 19, 2020, 2:54am

hello ACervantes,

please download the attachment, June19_Topic126439.zip (3.6 KB) ; there’re two kernel patches for your verification.
thanks

ACervantes · June 23, 2020, 5:25pm

Hi JerryChang,

The patches worked very well, thank you!

-Adrian

JerryChang · June 24, 2020, 2:11am

hello ACervantes,

FYI, those kernel fixes also merge to release code-line, please expect next public release (i.e. JP-4.4 GA / l4t-r32.4.3) will include that fixes.
thanks

Topic		Replies	Views
V4L2 timeout leads to NULL pointer dereference in kernel in jetpack 4.6 Jetson Xavier NX camera , nvbugs	14	4155	October 18, 2021
when camera stream stop abnormally, v4l2-ctrl cannot be killed Jetson AGX Xavier	8	1133	October 18, 2021
Open camera occur system reboot on jetpack 4.4 Jetson AGX Xavier camera	18	871	February 1, 2023
jetson xavier hdmitocsi debug v4l2-compliance ''Unable to handle kernel NULL pointer dereference at virtual address 000001e0" Jetson AGX Xavier	17	1845	October 18, 2021
Jetson AGX JP4.6.2 32.7.2 + 4CAM imx334 runs for a while and all four cameras disconnect Jetson AGX Xavier camera	3	282	February 19, 2024
VI Engine crashing when camera source not delivered Jetson Xavier NX mmapi	9	717	August 18, 2023
Jetpack 5.02 camera driver port leads to tegra-capture-vi timeout Jetson AGX Xavier camera , kernel	11	925	April 5, 2023
JP 5.1 Frame drop issues Jetson AGX Xavier camera	19	880	July 26, 2023
System crashes when doing a repeated camera capture Jetson AGX Xavier camera , nvbugs	12	687	March 6, 2023
How to resume VI capture Jetson Orin NX camera	14	76	October 9, 2024

Tegra channel error recovering on Xavier when capturing with v4l2

Related topics