nvargus-daemon freeze/hang on pipeline stop on R32.1

alex.sack · September 10, 2019, 12:00pm

I need a patch fix for 32.1 since the driver isn’t supported on 32.2 (I will double check). Can you send me one?

alex.sack · September 10, 2019, 12:03pm

JerryChang:

hello alex.sack,

FYI,
I also found it works normally by removing 2nd nvarguscamerasrc from the RECORD_PIPELINE.

however, you might review your python scripts commands.
since I has also verified enable dual camera sources for video encoding works.
for example,
$ gst-launch-1.0 nvarguscamerasrc sensor-id=0 num-buffers=300 ! 'video/x-raw(memory:NVMM), width=2592, height=1944, framerate=30/1' ! nvtee ! omxh264enc bitrate=20000000 ! qtmux ! filesink location=video0.mp4  nvarguscamerasrc sensor-id=1 num-buffers=300 ! 'video/x-raw(memory:NVMM), width=2592, height=1944, framerate=30/1' ! nvtee ! omxh264enc bitrate=20000000 ! qtmux ! filesink location=video1.mp4

NVIDIA has a bug. It smells like a timing one.

I need two pipelines to work not one. AFAICT, after speaking with the gstreamer team, this is entirely an issue with nvarguscamerasrc/nvargus-daemon flushing buffers on pipeline shutdown. Under no circumstances should set_state(Gst.State.NULL) cause a hang. And the fact that nvargus spews a bunch of stuff and/or crashes is not the right thing to do regardless of client scripts.

I really, really need a patch for this. Can you help me?

DaneLLL · September 10, 2019, 12:39pm

Hi,
The modification is significant and we have concern to offer patch. It is with more potential to harm stability.
[url]https://devtalk.nvidia.com/default/topic/1051362/jetson-tx2/bug-in-nvarguscamerasrc-with-gstrtspserver/post/5350039/#5350039[/url]
We can try to have further contact and cooperation on it.

alex.sack · September 10, 2019, 1:08pm

Alright, but this is a show stopper. I literally have no way to make my application work unless you can offer me a workaround?

Is there a beta patch I could at least try just to confirm that the r32.2 patch really fixes it?

EDIT: I will confirm with the vendor (Leopard Imaging) if they have a driver available for r32.2 for my sensor as well.

I sent some email via the contact page as per the other thread.

alex.sack · September 11, 2019, 11:41am

Vendor won’t have driver till end of year at the earliest. We are officially stuck unless you can facilitate some kind of patch or work around for this bug.

Would polling the gstreamer bus work instead of relying on asynchronous messaging of the main event loop? I only suggest this since gst-launch-1.0 actually polls (but again all of this is in C which also complicates the issue if it is indeed timing related).

DaneLLL · September 12, 2019, 3:01am

Hi,
We are checking it. Since the change is significant, it would take some time in verification.

alex.sack · September 12, 2019, 1:26pm

Thanks! Please keep me posted.

alex.sack · September 20, 2019, 2:58am

Can I please get an update?

DaneLLL · September 20, 2019, 8:39am

Hi,

we are verifying it.

alex.sack · September 26, 2019, 11:42am

I’d like an ETA on the fix and I want to verify that you are using my script (or something equivalent) to verify the patch. Thanks.

alex.sack · September 30, 2019, 6:32pm

@DaneLLL: Bump?

DaneLLL · October 1, 2019, 3:41am

Hi,
Please try the attachment.
Thorough tests are performed on r32.2. Still strongly suggest users upgrade to this version.
r32_1_TEST_libgstnvarguscamerasrc.zip (22.7 KB)

alex.sack · October 1, 2019, 10:34am

I’d love to but our vendor doesn’t support r32.2 yet.

alex.sack · October 1, 2019, 10:43am

Library hangs and segfaults when setting pipeline to PLAYING. It also seems to hang with one also tries to get the current pipeline state (get_state()) etc.

Can you please test my script with your patch on R32.1?

DaneLLL · October 1, 2019, 10:49am

Hi,
We have verified the script. One difference is that the resolution is modified to 2592x1944 since we don’t have camera boards supporting 4K. Please check the md5sum

$ md5sum libgstnvarguscamerasrc.so
676024de317084d2e11fefb5d7d92e0a  libgstnvarguscamerasrc.so

alex.sack · October 1, 2019, 2:53pm

DaneLLL:

Hi,
We have verified the script. One difference is that the resolution is modified to 2592x1944 since we don’t have camera boards supporting 4K. Please check the md5sum
$ md5sum libgstnvarguscamerasrc.so
676024de317084d2e11fefb5d7d92e0a  libgstnvarguscamerasrc.so

The camera’s resolution should have nothing to do with this. This is simply nvarguscamerasrc keeping track of GST bus state correctly.

If you restart am existing pipeline that has been stopped it just crashes/hangs. So set the pipeline’s bus state to PLAYING, sleep a few seconds to record some frames, then set it to either NULL or PAUSED/READY after catching an EOS event, wait a few seconds to settle, then put it in the PLAYING state again. Boom!

NOTE: After this happens, nvargus-daemon also freezes and needs to be restarted to even initialize a new pipeline again.

I’m seeing if we recreating the pipeline from scratch on every restart is a viable workaround. But you still have some serious bugs here.

You want to reproduce this, modify my initial test script to flip the bus state back and forth a few times. The nvargus-camera daemon and application will hang/crash almost immediately.

Again, my application is a recorder program that the user can stop at any moment and restart at will. This worked fine on R28.2.1.

alex.sack · October 1, 2019, 3:21pm

Recreating the pipeline from scratch every time seems to be a viable workaround. But there seems to no way to restart an existing pipeline after an EOS event is caught and the state changes to READY or NULL.

DaneLLL · October 2, 2019, 2:06am

Hi,
Please share another script so that we can simply apply it to run the usecase.

alex.sack · October 2, 2019, 1:55pm

https://drive.google.com/file/d/1IWqQOeVqt-sifhWBjZnCBK1bjePJtKev/view?usp=sharing

Starts two recording sessions, waits a few seconds, sends EOS, catches EOS and sets pipeline to NULL state then waits a second, restarts recording again by setting pipeline back to PLAYING, then quits.

When the pipeline is set to PLAYING for the second time I get an immediate EOS event sent to the bus. Why? Also everything after that starts to fall apart (nvargus-deamon just hangs).

Two simultaneous connections are made to the daemon
The user has an error in their pipeline syntax (why does nvargus-deamon just crash if the client can’t instantiate a pipeline correctly? That’s really bad behavior).

alex.sack · October 2, 2019, 6:00pm

(Argus) Error EndOfFile: Unexpected error in reading socket (in src/rpc/socket/client/ClientSocketManager.cpp, function recvThreadCore(), line 266)
(Argus) Error EndOfFile: Receive worker failure, notifying 1 waiting threads (in src/rpc/socket/client/ClientSocketManager.cpp, function recvThreadCore(), line 340)
(Argus) Error InvalidState: Argus client is exiting with 1 outstanding client threads (in src/rpc/socket/client/ClientSocketManager.cpp, function recvThreadCore(), line 357)
(Argus) Error EndOfFile: Receiving thread terminated with error (in src/rpc/socket/client/ClientSocketManager.cpp, function recvThreadWrapper(), line 368)
(Argus) Error EndOfFile: Client thread received an error from socket (in src/rpc/socket/client/ClientSocketManager.cpp, function send(), line 145)
(Argus) Error EndOfFile: (propagating from src/rpc/socket/client/SocketClientDispatch.cpp, function dispatch(), line 87)
WARNING Argus: 10 client objects still exist during shutdown:
548218175816 (0x7f6800dff8)
548218176168 (0x7f60008ef8)
548221039440 (0x7f68001770)
548238187840 (0x7f60000c80)
548238188000 (0x7f680017f0)
548238188208 (0x7f60000d00)
548238193328 (0x7f68001930)
548238193680 (0x7f60000e20)
548238197616 (0x7f600026a0)
548238206896 (0x7f6800df20)

This also occurs if you shutdown an application while a stream is running. If these are warnings (stale sockets), then say so otherwise this is very confusing.