TX1 NvVideoDecoder causing hang/reboot on L4T 32.7.4

cjwti · October 17, 2023, 11:50pm

Hi,

I am updating my device which previously used L4T 28.3.2 to 32.7.4 as required by a customer for access to the newer Linux kernel. During testing the customer observed that the unit reboots spontaneously every so often. This device is receiving H.264 RTP or RTSP streams and decoding them, compositing them and displaying them on the HDMI output.

I was able to reproduce what I think is the cause, and narrowed it down to the NvVideoDecoder getting stuck somewhere during many iterations of deleting and recreating the decoder. My usage of NvVideoDecoder is modeled after the 00_video_decode sample. Mostly the error happens very quickly after I queue an EOS buffer to the decoder capture thread in order to signal it to stop. I went through all the similar threads on this forum and tested many different things, including updating the r8152 driver to a newer version (which did not change the behavior), inserting some sleeps to check for races (which do change the behavior), and sequentially shutting down decoders rather than concurrently (which I am testing now).

The initial error I found was a syncpoint owned by NVDEC was timing out, which would then cause the PMIC watchdog (which we have enabled on this device) to reboot the system very quickly. I disabled the PMIC watchdog and found that the Linux kernel will eventually kill a hung thread and print a stack trace and reboot. One of the stack traces showed that the hung thread was NVMDecBufProcT. These runs take anywhere from 2-5 hours to crash, so deciding each next step to debug is tedious.

I have attached three stack traces along with some logging from our application to show the decoder thread and capture thread activity. One trace shows the R8152 driver hanging, one shows the NVMDecBufProcT hanging, and one shows the tegradc irq task hanging. One also shows the output of the syncpoint debug info when the syncpoint timeout is detected in the kernel driver. This is a bit interspersed with my application’s output.

I am aware the TX1 is now unsupported, but I’d also like to update our Nano-based product to use 32.7.4 and I don’t want to try that until I have figured this out. We are also planning to upgrade our products to use the Orin NX but I haven’t started porting our software yet as I need to complete this task first.

Thank you for your time,

Chris Richardson
RebootingCodecRestarts_2023-10-17_NVIDIA_StackTrace1.txt (10.3 KB)
RebootingCodecRestarts_2023-10-17_NVIDIA_StackTrace2.txt (5.5 KB)
RebootingCodecRestarts_2023-10-17_NVIDIA_StackTrace3.txt (14.2 KB)

DaneLLL · October 18, 2023, 1:43am

Hi,
For checking the issue, we need to reproduce the issue on developer kit. Please try to reproduce the issue by running either sample(default or with a patch):

/usr/src/jetson_multimedia_api/samples/00_video_decode/
/usr/src/jetson_multimedia_api/samples/unittest_samples/decoder_unit_sample/

And share us the patch and steps.

cjwti · October 26, 2023, 11:33pm

Hi DaneLLL,

Thanks a lot for your response. I am working on reproducing this with the 00_video_decode program. I will post it here once I have it working.

For my own real-world use case, I did find a workaround that seems to make the issue go away. Normally I shut down all the decoders concurrently, but I changed to shutting them down consecutively and the device can run for much longer periods (e.g. 24 hours rather than 2-3). I haven’t run a test longer than that because I don’t have time for that yet. I would prefer not to use this workaround in my release software if possible though.

Thanks,

Chris Richardson

DaneLLL · October 27, 2023, 5:19am

Hi,
This sounds similar to
Jetson h264 decoder flush deadlock

It shall be fxied in Jetapck 4.6.4. V4L2_BUF_FLAG_LAST is added in signaling EoS. Not sure if you have code to check the flag in your application:
Jetson h264 decoder flush deadlock - #20 by DaneLLL

If you have followed the sample code and the issue is still present, please share a method to reproduce it on either reference sample. So that we can replicate it and check.

cjwti · October 27, 2023, 6:33am

Hi DaneLLL,

Thank you for the info. I did see those threads and do have the EOS handling changes which seem to be in the patch at #20 that you linked. The changes were already present in the sample program in the L4T 32.7.4 that I’m using. My application also has those changes now and does receive the EOS frame in the capture thread in the EAGAIN handling in normal circumstances. It does not seem to matter.

The decoder hang I was seeing seems to be lower level than the dqBuffer being hung (which my application detects separately and does not see here). The hang caused everything on the system to halt, which triggered the PMIC watchdog since even my very high priority watchdog thread was not able to run. The hang almost always occurs immediately after my main decoder thread queues an EOS buffer onto the output plane while the application is trying to shut all the decoders down. The capture thread doesn’t even have a chance to run to show that the dqBuffer call returned.

Thanks,

Chris Richardson

cjwti · October 27, 2023, 6:35am

Hi DaneLLL,

I’m sorry I forgot to ask one more question. Can I use the libtegrav4l2.lib from that link on L4T 32.7.4? If so I can try that out to see if it makes a difference.

Thanks,

Chris Richardson

DaneLLL · October 27, 2023, 7:08am

Hi,
r32.7.4 has the fix already. So the issue you are seeing can be another one. Please try to reproduce it on either reference sample, and share us the method. So that we can replicate it and check.

system · November 20, 2023, 7:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.