nvcamera-daemon fails to recover and repeats last frames

While developing our camera application based on TX2, I have noticed several bugs. But as it goes, without replication nothing could be solved.

Here is one: When broken, ISP sends few last frames forever.

The issue is traced down to nvcamera-daemon not doing any recovery. It can be triggered with this script:

verbose-nvcamd.sh

#!/bin/bash
datetime=`date +%Y%m%d-%H%M%S`
log_file=/home/nvidia/nvcamd-$datetime.log
#stop nvcamera-daemon
killall nvcamera-daemon
export enableCamPclLogs=1
export enableCamScfLogs=1
/usr/sbin/nvcamera-daemon 2>&1 | tee "$log_file"

Run this in one shell, while run a gstreamer with nvoverlaysink on another. As lot of output is produced, press SCROLL-LOCK in order to examine the messages - but after about tens of seconds the live image on the overlay stops - likely the buffer for scroll-lock is full and the nvcamera-daemon is put to not-running state. Pressing SCROLL-LOCK to deassert the blocking, will leave the program running again, but it will repeat the last frames. This happens with a high likelyhood and I have observed only 1 in 10 cases when the stream continues from live camera source.

The normal use case is to not use verbose messages, but the repeated frames issue was observed by us prior we knew there is this userspace dameon handling the processing. The result is - a high application load which causes scheduling inbalance will render the camera into non-working state. Sometimes it was enough to move windows around the Ubuntu desktop UI and it crashed the camera operation.

A recovery method needs to be put into nvcamera-deamon, because doing realtime tasks from regular userspace is not possible on TX2.

This was tested on single 12MP/30fps camera from Leopard Imaging.

@danieel
Does this problem can repo on reference camera board(E3326 OV5693)?
Could you have detail step if the issue can repo on E3326 too.

Hi danieel,

Any update or this issue has been resolved?

Thanks

No it was not resolved. And you should verify it yourself.

On my TX2, when I put back the kernel/dtb so that the OV5693 is detected and works through qv4l2, but it crashes the nvcamera-daemon when attempting to run nvgstcapture:

added strace which ends as:

Thread 1 getting next capture
Thread 1 is waiting
Thread 2 getting next capture
Thread 2 is waiting
Thread 3 getting next capture
Thread 3 is waiting
Thread 4 getting next capture
Thread 4 is waiting
Thread 5 getting next capture
Thread 5 is waiting
Thread 6 getting next capture
Thread 6 is waiting
Thread 7 getting next capture
Thread 7 is waiting
Thread 8 getting next capture
Thread 8 is waiting
Thread 9 getting next capture
Thread 9 is waiting
Thread 10 getting next capture
Thread 10 is waiting
Thread 11 getting next capture
Thread 11 is waiting
Thread 12 getting next capture
Thread 12 is waiting
Starting services...
Worker thread IspHw statsComplete start
Worker thread IspHw frameComplete start
Worker thread CaptureScheduler checkFramePending start
Worker thread CaptureScheduler frameStart start
Worker thread V4L2CaptureScheduler checkCaptureComplete start
Worker thread V4L2CaptureScheduler issueCaptures start
 <unfinished ...>
+++ killed by SIGSEGV (core dumped) +++
./verbose-nvcamd.sh: line 8:  3027 Segmentation fault      (core dumped) strace /usr/sbin/nvcamera-daemon

With the crash point being in… wait for it… your amazing CUDA!

[  436.595679] nvcamera-daemon[3040]: unhandled level 2 translation fault (11) at 0x00000028, esr 0x92000006
[  436.605267] pgd = ffffffc18658b000
[  436.608680] [00000028] *pgd=0000000261d1e003, *pud=0000000261d1e003, *pmd=0000000000000000

[  436.618479] CPU: 3 PID: 3040 Comm: nvcamera-daemon Not tainted 4.4.38-tegra #1
[  436.625706] Hardware name: quill (DT)
[  436.629379] task: ffffffc055502580 ti: ffffffc1eaa94000 task.ti: ffffffc1eaa94000
[  436.636870] PC is at 0x7f9a7ba628
[  436.640195] LR is at 0x7f9a7ba618
[  436.643502] pc : [<0000007f9a7ba628>] lr : [<0000007f9a7ba618>] pstate: 60000000
[  436.650904] sp : 0000007f99faa400
[  436.654240] x29: 0000007f99faf9d0 x28: 0000000000000000 
[  436.659589] x27: 000000000000043f x26: 0000007f946daff0 
[  436.664949] x25: 0000007f947c5aa0 x24: 0000000000200000 
[  436.670294] x23: 0000000fc1800000 x22: 0000007f947c5ac0 
[  436.675637] x21: 0000007f947c5ac0 x20: 0000007f947c5820 
[  436.680996] x19: 0000000000000000 x18: 000000000000000f 
[  436.686339] x17: 0000007f9b821fb0 x16: 0000007f9ae24348 
[  436.691694] x15: 0000007f9c216000 x14: 0000000000000000 
[  436.697039] x13: 0000000000000000 x12: 0000000000000000 
[  436.702378] x11: 0000000000000000 x10: 0000000000000000 
[  436.707736] x9 : 0000007f94975040 x8 : 0000000000000000 
[  436.713086] x7 : 0000000000000000 x6 : 0000000000000000 
[  436.718433] x5 : 0000000000000be0 x4 : 0000007f946dcc80 
[  436.723782] x3 : 0000007f99fb01e0 x2 : 0000000000000000 
[  436.729126] x1 : 0000000000000000 x0 : 0000000000000000 

[  436.735979] Library at 0x7f9a7ba628: 0x7f9a54a000 /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
[  436.744858] Library at 0x7f9a7ba618: 0x7f9a54a000 /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
[  436.753731] vdso base = 0x7f9c215000

And here you know what - you are the ones to solve it.
We have no source code access for nvcamera-daemeon thus we are unable to trace the issue further.

But lets focus on the original issue, since that is a major one, affecting the reliability of an end product.

@danieel
Please give as detail repo step on the ov5693. And I believe it’s r28.1.

I do not want to break my TX2 so I took a TX1 instead, it should really not matter.

Reproduction:

  1. Get a Jetson TX1 (or TX2) and a HDMI monitor
  2. Flash L4T R28.1 using Jetpack in order to get a clean starting point
  3. Open 2 terminals through SSH to the target board
  4. Create the above verbose-nvcamd.sh script and set chmod +x for it
  5. Terminal 1: sudo ./verbose-nvcamd.sh
  6. Terminal 2: nvgstcapture
  7. There should be an image from camera on the screen
  8. In terminal 1 press SCROLL-LOCK
  9. After a while the image freezes
  10. Wait few minutes
  11. Press SCROLL-LOCK again
  12. Image is no longer working (occassionally it will not block, happened when I did not wait too long)

The nvcamera-daemon will output forever this:

Issue bubble captures
NvPclSettingsUpdate: Sending Updated Settings through PCL
NvPclSettingsApply: Applying last settings through PCL
apply:+++++++++++++++++++++++
apply:-----------------------
NvPclSettingsApply: Reading PCL settings

Complete log is attached (used Scroll-lock twice to finally trigger the issue).
nvcamd-20180213-121253.zip (257 KB)

How long can repo it. Does the step 8 must have? What if don’t enable the log by export enableCamPclLogs=1 and export enableCamScfLogs=1?

ShaneCCC - the combination of enabling the logs and using scroll-lock is ensuring that the process is no longer scheduled as running, but blocks on a write() call.

Maybe you can mess with the scheduler either by having a high load on the system or sending some signals - I do not know. The above sequence is the way how I can reproduce the issue.

Clearly - your process is expecting that it has some real-time privileges which it has not. Any trouble causing that it is not able to perform the desired ISP handling tasks in time will result in breaking the camera operation.

Another slightly related issue is when there are signal integrity issues and the MIPI data is corrupt - a missing SOF or EOF will cause weird timeouts in the camera daemon or out of memory messages, bunch of logs in dmesg and halting operation of ALL sensors. The process often crashes, but when not, it is again not able to recover from such state. It is everything but not robust.

Could you confirm this problem can reproduce on reference camera board ov5693 as I didn’t get the issue as your step.

ShaneCCC - yes, the above guide is for an out of box TX1/TX2 with the OV5693 reference camera.

This likely applies to all current uses of SCF - with any sensor (direct MIPI to ISP processing) since you do not yet support images sourced from memory.

Does it easy to reproduce? How much time can it be repo?

Yes, it is easy to reproduce, see the above guide, how to break the functionality.

@danieel
Could you try the argus_camera to clarify.

@ShaneCCC - I have compiled the argus as Leopard Imaging suggests for use with their IMX377 camera. I am getting many lockups and it seems that the imaging subsystem is not able to recover from any error. Otherwise this argus_camera would be a nice demo tool.

The camera’s do randomly crash - when playing with the setting, or when I enable multisession, recording of video does not work, after some settings we get 1 fps and will not get back to 30, after some we get only 20 fps. Sometimes argus_camera GUI goes black and white - nonresponding application. Switching inputs causes errors like:

nvidia@tegra-ubuntu:~$ DISPLAY=:0 argus_camera -d 1

Executing Argus Sample Application (argus_camera)
Argus Version: 0.96.2 (multi-process)
(Argus) Error EndOfFile: Unexpected error in reading socket (in src/rpc/socket/client/ClientSocketManager.cpp, function recvThreadCore(), line 212)
(Argus) Error EndOfFile: Receiving thread terminated with error (in src/rpc/socket/client/ClientSocketManager.cpp, function recvThreadWrapper(), line 315)

And in dmesg appears a lot of seemingly unhelpful data dump (attached).

It seems that this Argus tool uses the same SCF approach as the nvcamera-daemon, only that it is running in the process itself (or in so far unidentified place, did you hide your blackbox in another blackbox??). I am unable to provide any helpful messages in order to debug this other than - IT JUST DOES NOT WORK. The few seconds of a single camera is not enough for us to be able to develop anything on your Tegra platform.

After certain crashes, we are unable to get image in the Argus window - so yes, the bugginess of the platform is double-confirmed now.
argus.dmesg.txt (111 KB)

If the unstable problem only with the IMX377 you may need Leopard’s help.

@ShaneCCC - Leopard is unable to solve your closed source SCF and Argus issues, and as a scaling partner of nVidia they turned out totally useless and incompetent for this task.

The only thing Leopard is to blame is that their kernel driver calls take too long, due to muxed I2C architecture on hardware when using 3 sensors and then Sony got insanely long sensor initialization blobs. There is practically no way out of this - the communication with the hardware is sometimes slow you know.

As I have pointed out, the ISP handling code of SCF expects that things happen immediately which is not how the real world systems work. The process is unable to recover when its syscall is blocked (scroll-lock on printf), and behaves totally unstable when the V4L2 calls take too much time. If not crashing due the 250ms call to initialize a new sensor, it crashes due to intensive calls when adjusting the AE gain/exposure times on multiple sensors.

We know why the camera system is crashing.
We have a reproduce guide on stock devkit hardware.
Yet nVidia is unable to fix this issue?

We have spent many hours to figure out what is wrong with your black-box solution without you providing us any ISP documentation or SCF code. We did point to the cause as precise as we could, but you will still blame Leopard and Leopard can do nothing with the issue and it will be never going to be fixed.

@danieel
As I told in early comment we can repo your problem with the reference sensor board so we don’t know what could be wrong with ARGUS and SCF.
I will let QA to repo it again.

Hi danieel,

Following your steps, I can’t repro your issue on TX2.
Could you help try the issue with the latest JetPack3.2?

JetPack-3.2: https://developer.nvidia.com/embedded/dlc/jetpack-l4t-3_2-ga