Potential issues with Falcon leading to issues in vi5_fops

Hi,

my company has a carrier board that attaches to 8 GMSL cameras via a MAX9295-MAX9296 serdes pipeline, and we are encountering issues that seem a little random when the kernel panics, originally I was using a 5.1.2 kernel and then after finding and trying out these patches I eventually moved to a 5.1.3 kernel which has proved to be more stable.

https://nv-tegra.nvidia.com/r/gitweb?p=linux-nvidia.git;a=commitdiff;h=3294188e501a5f1fc6b7aadcc6cbff8ac8dcaf59

https://nv-tegra.nvidia.com/r/gitweb?p=linux-nvidia.git;a=commitdiff;h=d8749afd649c213927bb49c1384f063f099dd083

The kernel panics are a bit hard to narrow down what is going wrong but it seems to always stem from somewhere in the csi5/vi5_fops. An example panic being

[  479.780563] Unable to handle kernel paging request at virtual address fffef249efbe51b0
[  479.788738] Mem abort info:
[  479.791613]   ESR = 0x96000044
[  479.794782]   EC = 0x25: DABT (current EL), IL = 32 bits
[  479.800286]   SET = 0, FnV = 0
[  479.803431]   EA = 0, S1PTW = 0
[  479.806703] Data abort info:
[  479.809693]   ISV = 0, ISS = 0x00000044
[  479.813666]   CM = 0, WnR = 1
[  479.816721] [fffef249efbe51b0] address between user and kernel address ranges
[  479.824092] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[  479.829819] Modules linked in: fuse xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrags
[  479.829916]  nct1008 ina3221 typec max96712 tegra_bpmp_thermal snd_soc_spdif_tx tegra210_adma snd_soc_rl6231 fzcam snd_hda_core userspace_alert fzcama]
[  479.942957] CPU: 3 PID: 16162 Comm: vi-output, DS5  Not tainted 5.10.192a1 #20
[  479.950387] Hardware name: Jetson AGX Orin Developer Kit (DT)
[  479.956295] pstate: 80c00009 (Nzcv daif +PAN +UAO -TCO BTYPE=--)
[  479.962476] pc : queued_spin_lock_slowpath+0x214/0x3c0
[  479.967759] lr : queued_spin_lock_slowpath+0x20/0x3c0
[  479.972950] sp : ffff8000329db940
[  479.976360] x29: ffff8000329db940 x28: 0000000000000000
[  479.981824] x27: ffff8000304eb998 x26: ffff1aaa71d90880
[  479.987284] x25: ffff1aaa6f232530 x24: ffff1aab3cb6a400
[  479.992745] x23: ffff1aaa6f232538 x22: ffff8000329db9d8
[  479.998210] x21: ffff1aaa6e1a2c40 x20: 000000000fee65d4
[  480.003673] x19: ffff1aaa6f232538 x18: 0000000000000010
[  480.009133] x17: 0000000000000000 x16: ffffd798c8d250f0
[  480.014592] x15: ffff1aaa6e1a31b0 x14: 653331203569762d
[  480.020051] x13: 3439316172676574 x12: 3069763a78317473
[  480.025512] x11: 6f682e3030303034 x10: 0000000000000000
[  480.030979] x9 : 0000000000100000 x8 : 0000000000000000
[  480.032157] tegra194-vi5 13e40000.host1x:vi0@15c00000: capture control message timed out
[  480.036438] x7 : ffff1ab16cf31b00 x6 : ffff1aaa6e1a2c40
[  480.036440] x5 : ffffd798ca747b00 x4 : ffff1ab16cf31b00
[  480.036443] x3 : ffff1ab12549d690 x2 : ffffd798ca747b20
[  480.036445] x1 : ffffd798ca747b20 x0 : ffff1ab16cf31b08
[  480.036449] Call trace:
[  480.036456]  queued_spin_lock_slowpath+0x214/0x3c0
[  480.036467]  _raw_spin_lock+0x8c/0xa0
[  480.036473]  __mutex_unlock_slowpath.isra.0+0x154/0x1b0
[  480.044838] Unable to handle kernel paging request at virtual address ffffdc7aca747b30
[  480.050251]  mutex_unlock+0x6c/0x80
[  480.050260]  vi_capture_ivc_send_control.isra.0+0xf4/0x130
[  480.050262]  vi_capture_control_send_message+0x7c/0x230
[  480.050264]  vi_capture_control_message+0xa0/0x140
[  480.050267]  csi5_send_control_message+0x54/0xb0
[  480.050268]  csi5_stream_open+0x98/0x110
[  480.050271]  csi5_start_streaming+0xac/0xe0
[  480.050273]  csi5_error_recover+0x44/0x90
[  480.050281]  tegra_csi_error_recovery+0x58/0xc0
[  480.050292]  tegra_csi_sync_event+0x44/0x50
[  480.055758] Mem abort info:
[  480.061206]  vi5_channel_error_recover+0x1ac/0x200
[  480.061209]  tegra_channel_error_recover+0x78/0xa0
[  480.061211]  tegra_channel_kthread_capture_dequeue+0x130/0x1f0
[  480.061215]  kthread+0x148/0x170
[  480.061218]  ret_from_fork+0x10/0x18
[  480.061228] Code: 910020e0 8b0200a2 f861d863 aa0203e1 (f8236827)
[  480.066687]   ESR = 0x96000044
[  480.069209] ---[ end trace 6ec6b86511b39b1e ]---
[  480.074148]   EC = 0x25: DABT (current EL), IL = 32 bits
[  480.083671] Kernel panic - not syncing: Oops: Fatal exception
[  480.091480]   SET = 0, FnV = 0
[  480.095061] SMP: stopping secondary CPUs
[  480.095079] Kernel Offset: 0x5798b8d00000 from 0xffff800010000000
[  480.095080] PHYS_OFFSET: 0xffffe556c0000000
[  480.095083] CPU features: 0x08040006,4a80aa38
[  480.095084] Memory Limit: none
[  480.219431] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

But it is common that the culprit memory access changes when resolving the instruction addresses, which makes me believe some sort of memory corruption is happening.

While I can’t definitively say that it is the cause when I am testing, usually involving switching streams, there are always a lot of corr_err warnings usually before a panic so it makes me think that that error recovery/handling code is not able to keep up with error spam at the extent which we are experiencing them.

Digging into the errors I am getting now I am getting errors such as

tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 0x200, status: 14

which according to the status code are from the Falcon, I get 3 kinds of errors but cannot get any more information from the err_data. I ignored the most common error in vi5_capture_dequeue and this seemed to increase stability but the system is still very prone to panicking.

I came across this thread which also seems to fit as the panics started when the firmware of our Intel cameras decreased the frame period, ie. increased the datarate which is why I believe the skew error would be a likely culprit. My issue is that I boot the Jetson from a USB disk that does not contain a firmware partition and have nothing but painful experiences using the flash scripts from Nvidia and would very much appreciate the updated Falcon firmware in a standalone form to put into my rootfs to test.

On a side note, we also experience very commonly the following errors

[ 3215.206790] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: flags 2, err_data 0
[ 3217.839769] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 3217.849012] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: attempting to reset the capture channel
[ 3217.863781] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_close: Error in closing stream_id=4, csi_port=4
[ 3217.875830] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: successfully reset the capture channel
[ 3255.076773] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: flags 2, err_data 0
[ 3257.518888] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 3257.528099] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: attempting to reset the capture channel
[ 3257.538683] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_close: Error in closing stream_id=4, csi_port=4
[ 3257.551782] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: successfully reset the capture channel
[ 3257.774994] tegra-camrtc-capture-vi tegra-capture-vi: uncorr_err: request timed out after 2500 ms
[ 3257.784164] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: attempting to reset the capture channel
[ 3257.794198] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_close: Error in closing stream_id=2, csi_port=2
[ 3257.805406] t194-nvcsi 13e40000.host1x:nvcsi@15a00000: csi5_stream_open: VI channel not found for stream- 2 vc- 0
[ 3257.816808] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: successfully reset the capture channel

in case this helps give more context/is related.

Cheers,

Alex

Apply the patch from below topic for the kernel panic.

Thanks for the quick reply @ShaneCCC!

I had already integrated 0001-capture-ivc-fix-multi-cam-race-condition.patch it would seem.

0001-kernel-camera-remove-csi-enent-sync-for-recover.patch seems to of brought some stability but I am still experiencing some bad image tearing and a spam of the errors from my previous post.

I am having issues now with streaming, and I believe it is due to the conflict between the patch I took from here which seems to also modify vi5_channel_stop_streaming.

I am using a 5.1.3 kernel at the moment and was wondering what your recommendation would be? Should I revert to the 5.1.2 vi5_channel_stop_streaming + 0001-vi5_fops-fix-mem-leak.patch or is there a solution to merge the two?

My current function looks like this after I took my best-educated guess to merge the patch into my current code base, but it seems to not work. I applied this patch last and reverting it brings back my streams, albeit in a very poor state.

 998 static int vi5_channel_stop_streaming(struct vb2_queue *vq)                      
 999 {                                                                                
1000     struct tegra_channel *chan = vb2_get_drv_priv(vq);                           
1001     int vi_port = 0;                                                             
1002     if (!chan->bypass)                                                           
1003         vi5_channel_stop_kthreads(chan);                                         
1004                                                                                  
1005     /* csi stream/sensor(s) devices to be closed before vi channel */            
1006     tegra_channel_set_stream(chan, false);                                       
1007                                                                                  
1008     if (!chan->bypass) {                                                         
1009         for (vi_port = 0; vi_port < chan->valid_ports; vi_port++) {              
1010             /* Release capture requests */                                                                                                                                                              
1011             if (chan->request[vi_port] != NULL) {                                
1012                    dma_free_coherent(chan->tegra_vi_channel[vi_port]->rtcpu_dev, 
1013                    chan->capture_queue_depth * sizeof(struct capture_descriptor),
1014                    chan->request[vi_port], chan->request_iova[vi_port]);         
1015             }                                                                    
1016             chan->request[vi_port] = NULL;                                       
1017                                                                                  
1018             /* Release emd data buffers */                                       
1019             if (chan->emb_buf_size > 0) {                                        
1020                    struct device *vi_unit_dev;                                   
1021                    vi5_unit_get_device_handle(chan->vi->ndev, chan->port[0],\    
1022                                                                    &vi_unit_dev);
1023                    dma_free_coherent(vi_unit_dev, chan->emb_buf_size,            
1024                                                    chan->emb_buf_addr, chan->emb_buf);
1025                    chan->emb_buf_size = 0;                                       
1026             }                                                                    
1027                                                                                  
1028             filp_close(chan->fp[vi_port], NULL);                                 
1029             chan->tegra_vi_channel[vi_port] = NULL;                              
1030             kfree(chan->tegra_vi_channel[vi_port]);                              
1031         }                                                                        
1032                                                                                  
1033         /* release all remaining buffers to v4l2 */                              
1034         tegra_channel_queued_buf_done(chan, VB2_BUF_STATE_ERROR, false);         
1035     }                                                                            
1036                                                                                  
1037     return 0;                                                                    
1038 }

Your input on the merge and how to proceed would be greatly appreciated.

Cheers,

Alex

I don’t think the image tearing relative to the patches.
Those patches to fixed the kernel panic only.
You can check the trace log for more information.

Thanks

Yes, I agree, could it be because of the Falcon firmware deskew?

Do you have any input for how vi5_channel_stop_streaming should look in 5.1.3?

Cheers

I don’t think the deskew too. If deskew problem should receive frame but image tearing.

But I do receive frames, sometimes with quite bad tearing

Please check the trace log if any error.

Which trace specifically? I do not have much experience with the NVIDIA platforms

Get by below command.

echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 30720 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/tegra_rtcpu/enable
echo 1 > /sys/kernel/debug/tracing/events/freertos/enable
echo 3 > /sys/kernel/debug/camrtc/log-level
echo 1 > /sys/kernel/debug/tracing/events/camera_common/enable
echo > /sys/kernel/debug/tracing/trace

v4l2-ctl --stream-mmap -c bypass_mode=0

cat /sys/kernel/debug/tracing/trace

Thanks, will try it out and let you know

trace_1.txt (9.5 MB)
trace_2.txt (60.9 MB)t

Here are two traces and an image of the stream output in our webinterface for the robot.

Let me know if the information you’re after is in them.

Cheers

Boost the clocks to try.
If still the same it could be the output size doesn’t as expected.

sudo su
echo 1 > /sys/kernel/debug/bpmp/debug/clk/vi/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/isp/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/nvcsi/mrq_rate_locked
echo 1 > /sys/kernel/debug/bpmp/debug/clk/emc/mrq_rate_locked
cat /sys/kernel/debug/bpmp/debug/clk/vi/max_rate |tee /sys/kernel/debug/bpmp/debug/clk/vi/rate
cat /sys/kernel/debug/bpmp/debug/clk/isp/max_rate | tee  /sys/kernel/debug/bpmp/debug/clk/isp/rate
cat /sys/kernel/debug/bpmp/debug/clk/nvcsi/max_rate | tee /sys/kernel/debug/bpmp/debug/clk/nvcsi/rate
cat /sys/kernel/debug/bpmp/debug/clk/emc/max_rate | tee /sys/kernel/debug/bpmp/debug/clk/emc/rate

So far it seems good, my output was

root@cart999jetson:/home/cartken# cat /sys/kernel/debug/bpmp/debug/clk/vi/max_rate |tee /sys/kernel/debug/bpmp/debug/clk/vi/rate
832000000
root@cart999jetson:/home/cartken# cat /sys/kernel/debug/bpmp/debug/clk/isp/max_rate | tee /sys/kernel/debug/bpmp/debug/clk/isp/rate
1011200000
root@cart999jetson:/home/cartken# cat /sys/kernel/debug/bpmp/debug/clk/nvcsi/max_rate | tee /sys/kernel/debug/bpmp/debug/clk/nvcsi/rate
642900000
root@cart999jetson:/home/cartken# cat /sys/kernel/debug/bpmp/debug/clk/emc/max_rate | tee /sys/kernel/debug/bpmp/debug/clk/emc/rate
3199000000

Sometimes the tearing can be hard to reproduce so I will keep an eye on it to see if it reappears again.

Are there parameters in the device tree that can lock these clocks?

You can adjust the pix_clk_hz to acquire more clocks instead of boost to max clocks.

BTW, you need aware deskew note.

Skew calibration is required if sensor or deserializer is using DPHY, and the output data rate is > 1.5Gbps.
An initiation deskew signal should be sent by sensor or deserializer to perform the skew calibration. If the deskew signals is not sent, the receiver will stall, and the capture will time out.
You can calculate the output data rate with the following equation:

Output data rate = (sensor or deserializer pixel clock in hertz) * (bits per pixel) / (number of CSI lanes)

Is the skew calibration something that is handled by the tegra vi stack?

It’s handle by camera firmware.

Thanks

Hi @ShaneCCC

the patches seemed to of helped a lot, I no longer get any panics but sometimes it seems that my cameras get stuck in some sort of corr_err spam loop that brings my system to its knees. Making it very hard for me to get into our webui that I use to actually view the different streams.

The error spam looks like

[ 439.893624] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 439.895950] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 439.896138] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 439.960132] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 439.960303] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 439.962615] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 439.962813] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 440.026807] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 440.026969] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 440.029293] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072
[ 440.029483] tegra-camrtc-capture-vi tegra-capture-vi: corr_err: discarding frame 0, flags: 0, err_data 131072

Is there any information I can provide you to give more insight into this error or do you have any idea what is causing it. It is also quite random on when it happens. It can get stuck in such a loop for a few minutes and then sort of pull itself out of it randomly. Sometimes, I am able to view the streams, albeit usually only briefly, before the system becomes quite unresponsive.

Looking forward to your response.

Cheers,

Alex

Check the trace log for more information.

Thanks

Here is a trace running

v4l2-ctl --stream-mmap -c bypass_mode=0

and one running our webui that displays the video streams (this is normally where the errors are encountered as it switches between streams)

rtcpu_cartken_webui.txt (6.7 MB)
rtcpu_v4l2_stream_mmap.txt (1.3 MB)

Thanks!