Kernel Crash during Camera Module Reloading for more than 1 video device

Hi,

I’m working on camera driver development on JetPack-4.5.1 and I have detected a memory access issue when removing and reloading a camera driver module which registers 2 video devices. I haven’t seen it for only one video device.

I have observed the issue with a robust camera driver which creates up to 12 video devices. However, I was able to reproduce it by implementing a simple driver (V4L2 Kernel Driver Version 2.0) which only creates the video device and loading 2 instances of this driver.

The issue doesn’t appear all the time. I have identified the problem by using a script to remove and load the module in a loop until the issue appears. I can see it after a some of minutes, but sometimes it could take like 30 minutes or more to reproduce it.
I have noticed the issue appears faster when stressing the CPU (using stress-ng --cpu 12 --atomic 12).

This is the the Kernel error:

[ 1651.357405] tegra-vi4 15700000.vi: subdev camdummy 1-0052 unbind
[ 1651.365466] tegra-vi4 15700000.vi: subdev camdummy 1-0050 unbind
[ 1653.520046] camdummy 1-0050: probing v4l2 sensor
[ 1653.520141] camdummy 1-0050: tegracam sensor driver:camdummy_v2.0.6
[ 1653.520177] tegra-vi4 15700000.vi: subdev camdummy 1-0050 bound
[ 1653.541496] camdummy 1-0050: Detected CAMDUMMY sensor
[ 1653.541562] camdummy 1-0052: probing v4l2 sensor
[ 1653.541654] camdummy 1-0052: tegracam sensor driver:camdummy_v2.0.6
[ 1653.541684] tegra-vi4 15700000.vi: subdev camdummy 1-0052 bound
[ 1653.549194] Unable to handle kernel paging request at virtual address 30303735312d95
[ 1653.549598] camdummy 1-0052: Detected CAMDUMMY sensor
[ 1653.570865] Mem abort info:
[ 1653.573899]   ESR = 0x96000004
[ 1653.576971]   Exception class = DABT (current EL), IL = 32 bits
[ 1653.609166]   SET = 0, FnV = 0
[ 1653.612217]   EA = 0, S1PTW = 0
[ 1653.625221] Data abort info:
[ 1653.628099]   ISV = 0, ISS = 0x00000004
[ 1653.637195]   CM = 0, WnR = 0
[ 1653.640157] [0030303735312d95] address between user and kernel address ranges
[ 1653.653197] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 1653.658760] Modules linked in: cam_dummy bnep fuse zram overlay bcmdhd cfg80211 userspace_alert nvgpu bluedroid_pm ip_tables x_tables [last unloaded: cam_dummy]
[ 1653.673240] CPU: 0 PID: 24933 Comm: v4l_id Not tainted 4.9.201 #1
[ 1653.679319] Hardware name: quill (DT)
[ 1653.682973] task: ffffffc1e3d1e200 task.stack: ffffffc16b05c000
[ 1653.688885] PC is at read_phy_mode_from_dt+0x4c/0xb8
[ 1653.693839] LR is at csi4_mipi_cal+0x34/0x230
[ 1653.698187] pc : [<ffffff8008b4aa74>] lr : [<ffffff8008b4baac>] pstate: 20400045
[ 1653.705566] sp : ffffffc16b05f690
[ 1653.708871] x29: ffffffc16b05f690 x28: ffffffc1e3d1e200 
[ 1653.714194] x27: 0000000000000000 x26: 0000000000000000 
[ 1653.719517] x25: 0000000000000080 x24: ffffffc1e7552418 
[ 1653.724840] x23: ffffffc1eb650410 x22: ffffffc1e91b1028 
[ 1653.730160] x21: ffffffc1e91b1e58 x20: ffffffc1e91b1028 
[ 1653.735480] x19: 3030303735312d6d x18: 00000000000012be 
[ 1653.740800] x17: 0000000000000001 x16: 0000000000000000 
[ 1653.746122] x15: 00000000000002de x14: 0000000000004057 
[ 1653.751443] x13: 00000000014e8de5 x12: 00000000011eeadb 
[ 1653.756764] x11: 0000000000000000 x10: 0000000000000a10 
[ 1653.762084] x9 : ffffffc16b05f430 x8 : ffffffc1e3d1ec70 
[ 1653.767402] x7 : 00000000afb50401 x6 : 00000000000000bd 
[ 1653.772723] x5 : 0000000000000000 x4 : 000000000001459a 
[ 1653.778043] x3 : 00000000762e3030 x2 : 0000000000000000 
[ 1653.783364] x1 : ffffffc1eb650410 x0 : 0000000000000158 

[ 1653.790169] Process v4l_id (pid: 24933, stack limit = 0xffffffc16b05c000)
[ 1653.796943] Call trace:
[ 1653.799386] [<ffffff8008b4aa74>] read_phy_mode_from_dt+0x4c/0xb8
[ 1653.805380] [<ffffff8008b4baac>] csi4_mipi_cal+0x34/0x230
[ 1653.810767] [<ffffff8008b4ac20>] tegra_csi_mipi_calibrate+0x80/0xd0
[ 1653.817023] [<ffffff8008558594>] nvcsi_finalize_poweron+0x4c/0x98
[ 1653.823105] [<ffffff800852bef4>] nvhost_module_runtime_resume+0xbc/0x280
[ 1653.829794] [<ffffff800878becc>] pm_generic_runtime_resume+0x3c/0x58
[ 1653.836138] [<ffffff8008799d30>] __genpd_runtime_resume+0x38/0xa0
[ 1653.842220] [<ffffff800879c4a4>] genpd_runtime_resume+0xa4/0x210
[ 1653.848214] [<ffffff800878e214>] __rpm_callback+0x74/0xa0
[ 1653.853599] [<ffffff800878e274>] rpm_callback+0x34/0x98
[ 1653.858811] [<ffffff800878f710>] rpm_resume+0x470/0x710
[ 1653.864024] [<ffffff800878f9fc>] __pm_runtime_resume+0x4c/0x70
[ 1653.869843] [<ffffff800852ae2c>] nvhost_module_busy+0x5c/0x168
[ 1653.875663] [<ffffff8008b4c0c8>] csi4_power_on+0x20/0x58
[ 1653.880965] [<ffffff8008b491d8>] tegra_csi_power+0x38/0x158
[ 1653.886525] [<ffffff8008b49324>] tegra_csi_s_power+0x2c/0x38
[ 1653.892173] [<ffffff8008b3e104>] tegra_channel_set_power+0x84/0x198
[ 1653.898428] [<ffffff8008b44e58>] vi4_power_on+0x80/0xa0
[ 1653.903642] [<ffffff8008b3c118>] tegra_channel_open+0x80/0x180
[ 1653.909464] [<ffffff8008b0f9a8>] v4l2_open+0x80/0x118
[ 1653.914506] [<ffffff8008261f6c>] chrdev_open+0x94/0x198
[ 1653.919721] [<ffffff8008258918>] do_dentry_open+0x1d8/0x340
[ 1653.925282] [<ffffff8008259ed0>] vfs_open+0x58/0x88
[ 1653.930149] [<ffffff800826d3b0>] do_last+0x530/0xfd0
[ 1653.935102] [<ffffff800826dee0>] path_openat+0x90/0x378
[ 1653.940316] [<ffffff800826f450>] do_filp_open+0x70/0xe8
[ 1653.945531] [<ffffff800825a394>] do_sys_open+0x174/0x258
[ 1653.950831] [<ffffff800825a4fc>] SyS_openat+0x3c/0x50
[ 1653.955875] [<ffffff800808395c>] __sys_trace_return+0x0/0x4
[ 1653.961436] ---[ end trace 94da04ded20bebe6 ]---

Have you seen this problem?
Do you know if there’s a fix for this bug?

Thanks,
-Enrique

Did you verify on reference sensor ov5693?

Hi @ShaneCCC

The Jetson TX2 devkit supports a single ov5693 camera. As this problem appears only when registering 2 or more video devices, it’s not possible to reproduce the issue with this sensor.

Do you have any idea about what could be the problem?

-Enrique

Please provide the detail reproduce step I can check with multiple sensors board.

@ShaneCCC

You just need to remove and re-load the module several times.

I did a script to perform the modules reload loop, you just need to change the module name:
modules_reload_loop.sh (225 Bytes)

I also needed to stress (at least for the dummy driver setup) the system while with this command while reloading the modules:
stress-ng --cpu 12 --atomic 12

During the modules reload, I monitor the kernel messages until I get the issue:
dmesg -w

The issue could take some time to be observed.

Let me know if you need more information.

Thanks

Any reason for the module reloading continuously?

That’s an endurance test. If this is a race condition it could appear anytime, so we need to make sure that this problem will not occur.

Any news on this topic? Have you tried to reproduce it?

With the simple driver (dummy driver) I don’t see it very frequently, that’s why I’m stressing the system. However, for the robust driver that we are developing I see the issue very often. Sometimes it appears on the first reload (without running the stress-ng command).

We are able saw the issue and going to figure out the solution.

Great! Please keep me updated on any finding.

Thanks!

Current developer resource is tight may have slow progress.
Do you see the issue often during boot instead of stress?

I have never seen the issue after boot, it only appears after removing and re-loading the sensor module.

On our robust driver, the issue appears without using the stress command, but in that case the we have real cameras, 12 streams an higher power consumption.
Using the dummy driver I have only reproduce it the issue when using the stress command.

Could you have long run reboot test to confirm it. If the reboot without problem I think we can low priority for this issue.

@ShaneCCC

I can confirm that this issue doesn’t appear at boot time. However, I disagree with you about saying that this is a low priority issue since Loadable Kernel Module (LKM) is a feature that you are supporting, but it’s broken because re-loading could fail anytime.

-Enrique

Already reported to developer. But current can’t get resource to investigate it to figure the root cause.
Also if normal boot without problem should be low risk issue I think.

Looks like this issue fixed by new release J4.6.1(r32.7.1)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.