Tegra channel error recovering on Xavier when capturing with v4l2

Hi all,

I created a driver for Xavier and it is working very well capturing the incoming buffers. But I am seeing an interesting behavior when I try to capture but the buffers are not going to the CSI port. I am using v4l2-ctl to capture the stream.

Basically, if there are no buffers coming to the CSI the capture subsystem crashes and v4l2-ctl hang and the Xavier needs to be rebooted.

I would expect that the capture subsytem handles that kind of situations, here is the log when the issue happens:

[62376.103061] tegra194-vi5 15c10000.vi: no reply from camera processor
[62376.103225] tegra194-vi5 15c10000.vi: uncorr_err: request timed out after 2500 ms
[62376.103411] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[62376.106113] tegra194-vi5 15c10000.vi: err_rec: successfully reset the capture channel
[62378.663060] tegra194-vi5 15c10000.vi: no reply from camera processor
[62378.663249] tegra194-vi5 15c10000.vi: uncorr_err: request timed out after 2500 ms
[62378.663441] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[62378.663627] tegra194-vi5 15c10000.vi: unexpected response from camera processor
[62378.663761] video4linux video0: vi capture release failed
[62378.663889] tegra194-vi5 15c10000.vi: fatal: error recovery failed
[62467.405499] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[62467.405651] Mem abort info:
[62467.405703]   ESR = 0x96000005
[62467.405757]   Exception class = DABT (current EL), IL = 32 bits
[62467.405852]   SET = 0, FnV = 0
[62467.405908]   EA = 0, S1PTW = 0
[62467.405964] Data abort info:
[62467.406016]   ISV = 0, ISS = 0x00000005
[62467.406079]   CM = 0, WnR = 0
[62467.406146] user pgtable: 4k pages, 39-bit VAs, pgd = ffffffc335ef4000
[62467.406257] [0000000000000000] *pgd=0000000000000000, *pud=0000000000000000
[62467.406405] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[62467.406501] Modules linked in: ar_camera_agnostic ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bnep fuse zram overlay b53_mdio b53_common dsa_core spidev nvgpu bluedroid_pm ip_tables x_tables
[62467.408239] CPU: 1 PID: 1178 Comm: v4l2-ctl Not tainted 4.9.140tegra #32
[62467.408729] Hardware name: NVIDIA Jetson AGX Xavier 8GB Developer Kit (DT)
[62467.415917] task: ffffffc36a740e00 task.stack: ffffffc33a648000
[62467.421522] PC is at exit_creds+0x2c/0x80
[62467.425961] LR is at __put_task_struct+0x4c/0x148
[62467.430333] pc : [<ffffff80080deba4>] lr : [<ffffff80080b069c>] pstate: 60400045
[62467.437765] sp : ffffffc33a64ba10
[62467.441003] x29: ffffffc33a64ba10 x28: ffffffc36a740e00 
[62467.446514] x27: 0000000000000009 x26: ffffffc393038cc0 
[62467.452288] x25: ffffffc33a762310 x24: 0000000000000001 
[62467.457539] x23: ffffffc3b7c62048 x22: ffffffc3d9ee2b58 
[62467.462700] x21: ffffffc3ea346230 x20: 0000000000000000 
[62467.467864] x19: ffffffc3ea346200 x18: 0000000000000046 
[62467.473552] x17: 0000007f832d1f60 x16: ffffffbf0d176120 
[62467.479326] x15: 0000000000033800 x14: 0000000000000001 
[62467.484928] x13: 0000000000000551 x12: 0000000000000612 
[62467.490440] x11: 000000000000000b x10: 0000000000000a20 
[62467.496219] x9 : ffffffc33a64b850 x8 : ffffffc36a741880 
[62467.501990] x7 : fefefeff646c606d x6 : 000001c00c9e5e40 
[62467.507502] x5 : 0000000000000800 x4 : 0000000000000000 
[62467.512841] x3 : 00000000000000d8 x2 : 0000000000000000 
[62467.518177] x1 : 0000000000000000 x0 : 00000000ffffffff 

[62467.524938] Process v4l2-ctl (pid: 1178, stack limit = 0xffffffc33a648000)
[62467.531571] Call trace:
[62467.533949] [<ffffff80080deba4>] exit_creds+0x2c/0x80
[62467.538751] [<ffffff80080b069c>] __put_task_struct+0x4c/0x148
[62467.544092] [<ffffff80080dc6b0>] kthread_stop+0x1e0/0x1e8
[62467.548910] [<ffffff8008b20a18>] vi5_channel_stop_kthreads+0x40/0x58
[62467.555286] [<ffffff8008b20ac0>] vi5_channel_stop_streaming+0x90/0xb0
[62467.561158] [<ffffff8008b139a4>] tegra_channel_stop_streaming+0x34/0x48
[62467.567626] [<ffffff8008b0be04>] __vb2_queue_cancel+0x3c/0x170
[62467.573222] [<ffffff8008b0d22c>] vb2_core_queue_release+0x2c/0x58
[62467.579082] [<ffffff8008b0f840>] _vb2_fop_release+0x88/0xa8
[62467.584420] [<ffffff8008b15848>] tegra_channel_close+0x58/0x128
[62467.590109] [<ffffff8008ae989c>] v4l2_release+0x4c/0x98
[62467.595184] [<ffffff800825a8fc>] __fput+0x94/0x1d0
[62467.600000] [<ffffff800825aab0>] ____fput+0x20/0x30
[62467.604828] [<ffffff80080d9840>] task_work_run+0xc0/0xe0
[62467.610341] [<ffffff80080b98d0>] do_exit+0x2b8/0x9d8
[62467.614968] [<ffffff80080ba07c>] do_group_exit+0x3c/0xa0
[62467.620560] [<ffffff80080c76b4>] get_signal+0x29c/0x590
[62467.625735] [<ffffff800808b0c8>] do_signal+0x168/0x4e0
[62467.630974] [<ffffff800808b5b8>] do_notify_resume+0x90/0xb0
[62467.636317] [<ffffff8008083754>] work_pending+0x8/0x10
[62467.641391] ---[ end trace e71e96af6eea7fb3 ]---
[62467.654271] Fixing recursive fault but reboot is needed!

I think the TX2 does not have this issue anymore but it is because TX2 uses vi4 instead of vi5, vi4 handles those kind of problems.

Sometimes we would need to have the system trying to capture even if the buffers are not coming or at least does not have the system crashing and needing a reboot.

Do you have any ideas on how we could handle this situation?

Thanks,
-Adrian

hello ACervantes,

it’s true that TX2 and Xavier were using different VI drivers; TX2 working with VI-4 and Xavier using VI-5.

as you can see, here’s software recover mechanism triggered.

[62376.103411] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[62376.106113] tegra194-vi5 15c10000.vi: err_rec: successfully reset the capture channel

may I have your confirmation, did you try to disconnect the physical signal during sensor streaming?
or, could you please share the whole reproduce steps for reference.
thanks

Hi JerryChang,

Thanks for your response,

I was able to reproduce the issue with the Xavier EVM kit and ov5693 sensor.

I modified a little bit the ov5693 driver in order to avoid writing stream on register for the sensor. so basically you are not going to have buffers coming to Xavier CSI.

I would expect that the system fails when trying to capture, but V4L hangs and the system needs to be rebooted when I receive the error fatal: error recovery failed .

these are the messages that I see:

[   38.485515] tegra194-vi5 15c10000.vi: no reply from camera processor
[   38.485685] tegra194-vi5 15c10000.vi: uncorr_err: request timed out after 2500 ms
[   38.485839] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[   38.488524] tegra194-vi5 15c10000.vi: err_rec: successfully reset the capture channel
[   41.045510] tegra194-vi5 15c10000.vi: no reply from camera processor
[   41.045678] tegra194-vi5 15c10000.vi: uncorr_err: request timed out after 2500 ms
[   41.045860] tegra194-vi5 15c10000.vi: err_rec: attempting to reset the capture channel
[   41.046021] tegra194-vi5 15c10000.vi: unexpected response from camera processor
[   41.046191] video4linux video0: vi capture release failed
[   41.046294] tegra194-vi5 15c10000.vi: fatal: error recovery failed
[   60.861883] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[   60.862111] Mem abort info:
[   60.862215]   ESR = 0x96000005
[   60.862292]   Exception class = DABT (current EL), IL = 32 bits
[   60.862410]   SET = 0, FnV = 0
[   60.862477]   EA = 0, S1PTW = 0
[   60.862557] Data abort info:
[   60.862625]   ISV = 0, ISS = 0x00000005
[   60.862702]   CM = 0, WnR = 0
[   60.862774] user pgtable: 4k pages, 39-bit VAs, pgd = ffffffc3a8045000
[   60.862995] [0000000000000000] *pgd=0000000000000000, *pud=0000000000000000
[   60.863165] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[   60.863274] Modules linked in: ov5693 bnep fuse zram overlay spidev nvgpu bluedroid_pm ip_tables x_tables
[   60.863622] CPU: 0 PID: 7977 Comm: v4l2-ctl Not tainted 4.9.140-tegra #1
[   60.863741] Hardware name: Jetson-AGX (DT)
[   60.863820] task: ffffffc3a819d400 task.stack: ffffffc3c5f1c000
[   60.864199] PC is at exit_creds+0x2c/0x78
[   60.864504] LR is at __put_task_struct+0x4c/0x140
[   60.864865] pc : [<ffffff80080def7c>] lr : [<ffffff80080b01ac>] pstate: 60400045
[   60.869914] sp : ffffffc3c5f1fa10
[   60.873137] x29: ffffffc3c5f1fa10 x28: 0000000000000008 
[   60.878667] x27: ffffff8008f72000 x26: ffffffc3c5f1fde8 
[   60.884267] x25: ffffffc3e13bc9e8 x24: ffffffc3d711e518 
[   60.890107] x23: 0000000000000001 x22: ffffffc3e7906018 
[   60.895532] x21: ffffffc3e0954630 x20: 0000000000000000 
[   60.900624] x19: ffffffc3e0954600 x18: 0000000009112827 
[   60.906121] x17: 0000007f8528e698 x16: 0000000000000000 
[   60.911805] x15: 0000000000000000 x14: 0000000003369153 
[   60.917248] x13: 0000007f857f67a8 x12: 0000000000000000 
[   60.922759] x11: 00000000001192d4 x10: 0000000000000a10 
[   60.928532] x9 : ffffffc3c5f1f860 x8 : 0000000000000000 
[   60.934551] x7 : ffffff8009537000 x6 : 0000000000000001 
[   60.940063] x5 : ffffff8008163740 x4 : ffffffbf0f9e4bd0 
[   60.945153] x3 : 0000000000000001 x2 : 0000000000000000 
[   60.950737] x1 : 0000000000000000 x0 : 00000000ffffffff 

[   60.957486] Process v4l2-ctl (pid: 7977, stack limit = 0xffffffc3c5f1c000)
[   60.964128] Call trace:
[   60.966499] [<ffffff80080def7c>] exit_creds+0x2c/0x78
[   60.971310] [<ffffff80080b01ac>] __put_task_struct+0x4c/0x140
[   60.976392] [<ffffff80080dca3c>] kthread_stop+0x1e4/0x1e8
[   60.981467] [<ffffff8008b48e98>] vi5_channel_stop_kthreads+0x40/0x58
[   60.987589] [<ffffff8008b48f3c>] vi5_channel_stop_streaming+0x8c/0xa8
[   60.993973] [<ffffff8008b3b8ac>] tegra_channel_stop_streaming+0x34/0x48
[   60.999927] [<ffffff8008b33bbc>] __vb2_queue_cancel+0x34/0x188
[   61.005782] [<ffffff8008b350e4>] vb2_core_queue_release+0x2c/0x58
[   61.011644] [<ffffff8008b37764>] _vb2_fop_release+0x84/0xa0
[   61.016726] [<ffffff8008b3d224>] tegra_channel_close+0x64/0x140
[   61.022415] [<ffffff8008b10d18>] v4l2_release+0x48/0xa0
[   61.027748] [<ffffff800825ef20>] __fput+0x90/0x1d0
[   61.032559] [<ffffff800825f0d8>] ____fput+0x20/0x30
[   61.037115] [<ffffff80080d9bf4>] task_work_run+0xbc/0xd8
[   61.042887] [<ffffff80080b9674>] do_exit+0x2c4/0xa08
[   61.047785] [<ffffff80080b9e48>] do_group_exit+0x40/0xa8
[   61.053122] [<ffffff80080c7744>] get_signal+0x26c/0x578
[   61.058028] [<ffffff800808b150>] do_signal+0x130/0x500
[   61.063535] [<ffffff800808b698>] do_notify_resume+0x90/0xb0
[   61.068875] [<ffffff800808379c>] work_pending+0x8/0x10
[   61.073955] ---[ end trace 90a52d796f5f18d0 ]---
[   61.094102] Fixing recursive fault but reboot is needed!

v4l hangs:

nvidia@nvidia:~$ v4l2-ctl -d /dev/video0 --set-ctrl bypass_mode=0 --stream-mmap
^C^C^C^C^C^C^C^C^C

and these are the steps to reproduce the issue:

  1. Modify ov5693 device driver, commenting this line (start_streaming function) to avoid stream on:
    err = ov5693_write_table(priv, mode_table[OV5693_MODE_START_STREAM]);

  2. Load the modified module:
    sudo insmod ov5693.ko

  3. run v4l command:
    v4l2-ctl -d /dev/video0 --set-ctrl bypass_mode=0 --stream-mmap

  4. check the dmesg after trying to capture and the fatal error will appear. v4l command will hang.

Please let me know if you have any ideas about what the capture subsystem is not being recovered and fails with fatal error and then the system needs to be rebooted.

Thanks,
-Adrian

hello ACervantes,

since we had recently fix the error paths in vi5_channel_start_streaming() API based-on l4t-r32.4.2
may I know which JetPack release you’re working with?
I could release kernel patch for your testing if you’re based-on latest JetPack release,
thanks

Hi JerryChang,

We are using JP 4.4 (I think this is the last JP you have released).

Please share the patch so we can apply it and check if the issue was fixed.

Thank you.

hello ACervantes,

please download the attachment, June19_Topic126439.zip (3.6 KB) ; there’re two kernel patches for your verification.
thanks

Hi JerryChang,

The patches worked very well, thank you!

-Adrian

hello ACervantes,

FYI, those kernel fixes also merge to release code-line, please expect next public release (i.e. JP-4.4 GA / l4t-r32.4.3) will include that fixes.
thanks