Video Encode Error

We’re trying to track down a video encode problem on TX1 with the latest Jetson 4.4.1 release. Our problem is video encoding throws strange encoding problems in the long run. We’re able to reproduce our results with NVIDIA samples, namely the 01_video_encode demo in the multimedia samples. What we do is:

  1. Spawn 8 video encoder processes with 1080p resolution, all high profile
  2. Wait for all to finish
  3. Restart all if no errors
  4. Quit if any of the video encoders fail

Here is the log from video_encode application when the error occurs:

===== NVMEDIA: NVENC =====
NvMMLiteBlockCreate : Block : BlockType = 4 
Creating Encoder in blocking mode 
Opening in BLOCKING MODE 
875967048
842091865
H264: Profile = 100, Level = 51 
NvRmChannelSubmit: NvError_IoctlFailed with error code 22
NvRmPrivFlush: NvRmChannelSubmit failed (err = 196623, SyncPointIdx = 4, SyncPointValue = 0)
fence_set_name ioctl failed with 22
NvMMLiteVideoEncDoWork:NvMSEncConvertSurfaceFormat failed
VENC: NvMMLiteVideoEncDoWork: 4283: BlockSide error 0x4
NvVideoEnc: BlockError 
NvVideoEncTransferOutputBufferToBlock: DoWork failed line# 667 
NvVideoEnc: NvVideoEncTransferOutputBufferToBlock TransferBufferToBlock failed Line=678
[ERROR] (NvV4l2ElementPlane.cpp:178) <enc0> Capture Plane:Error while DQing buffer: Invalid argument
Error while dequeing buffer from output plane
NvVideoEncTransferCaptureBufferToBlock: DoWork failed line# 631 
NvVideoEncTransferCaptureBufferToBlock: DoWork failed line# 631 
NvVideoEncTransferCaptureBufferToBlock: DoWork failed line# 631 
[ERROR] (NvV4l2ElementPlane.cpp:178) <enc0> Output Plane:Error while DQing buffer: Invalid argument
ERROR while DQing buffer at output plane
Encoder is in error

We’ve created a simple bash script demonstrating the issue, source codes can be accessed from [1]. Our Jetpack version is:

# R32 (release), REVISION: 4.4, GCID: 23942405, BOARD: t210ref, EABI: aarch64, DATE: Fri Oct 16 19:44:43 UTC 2020

And we are running video_encode application with the following command:

./video_encode test.raw 1920 1080 H264 out.264 -p high

where test.raw is a random data generated with the following command:

dd if=/dev/urandom of=test.raw bs=3110400 count=30

size is arbitrary, data is random to cause maximum entropy to stress encoder. Please note that this issue is present long before Jetpack 4.4, we’re seeing similar problems with Jetpack 3.3 as well.

Thanks,
Caglar


[1] https://gitlab.com/sparsetech/jetson-video-encode-stress-test

1 Like

Hi,
Thanks for reporting the issue. We will try to reproduce it and do investigation.

Hi,
One question, do you hit the issue if you generate test.raw with

gst-launch-1.0 videotestsrc num-buffers=60 ! video/x-raw, width=1920,height=1080 ! filesink location=test.raw

Update:
We have tried the script and somehow the following logic hangs the test:

        while [[ "$cnt" != "0" ]]
        do
               echo waiting
               sleep 1
               cnt=$(ps -A | grep -i test.sh | wc -l)
        done

We change it to

        sleep 5

Don’t observe the issue in looping 2000+ times.

Yes, it happens even if I generate test video with the videotestsrc, though somewhat later. Let’me give you some more numbers about my test cases. Let me call one pass of all N encoder over input video an iteration. Then,

  1. Generate a full random video, use 8 encoders, average failure iteration is ~300
  2. Generate a semi-random video with videotestsrc, average failure iteration is ~600
  3. Generate a full random video, use 1 encoder, no failure (more than 40k iterations)
  4. Generate a full random video with gstreamer(*), use 8 encoders, average failure iteration is ~300

We have several devices in the field, in the order of hundreds, and we’re randomly facing similar encoder crushes on several devices. I believe the issue is related to stress level on devices.

Your 5 seconds trick did not work for me, I still get the crashes with 5 seconds sleep.

Another footnote, I monitor thermal footprint of my devices with ‘sudo tegrastats’ and issue is more likely to happen when temperature, i.e.’ thermal@53.25C’ part, is above 60C.

Thanks,
Caglar

(*) Using videotestsrc pattern=snow

Hi,
We don’t reproduce it on TX1 developer kit with JP4.4.1. It runs over 2000+ times and no issue is seen. By default the tX1 module is with a fan, do you keep it? Or you remove it and use other thermal solution?

Hi,

Please note that one of my test devices is failed after 27254 iteration. These tests are running on TX1s with carrier boards but I believe problem should be more or less related to TX1 modules.

About the thermal situation; I’m running some devices under active cooling which only delays the crashes. If you have any thermal metric suggestion to look for, from tegrastats, I can log it during our tests.

Thanks,
Caglar

Hi,
For information, do you observe it on TX1+developer kit? We have not replicated it so far. Would like to know if you have seen the issue on developer kit, or only on your carrier board.

Hi,

I found my good old TX1 developer kit and problem is present on it as well. I can prepare a remote (ssh) connection to this machine in case you need it.

Do you have any software change suggestions for mitigating this issue; we can implement any software guards so that we do not fall into this issue? At the moment we’re trying to stick with sample codes as much as possible.

Thanks,
Caglar

Hi,
We will try replicate it on TX1 developer kit.

UPDATE The test has passed 13289 times in loop and is still on-going. We stop it now and will set up a test on Friday for overweekend.

Hi Dane,

Some suggestions:

  1. Maybe you can increase the number of encoders, I sometimes use 12 encoders to fail some of my devices. I have devices failing in the field after 2 months of uninterrupted operation.

  2. Make sure that you’re using high profile, I’ve almost never faced this issue with baseline profile.

  3. Using multiple devices helps as well, some devices fail more frequently than others.

Thanks,
Caglar

Hi caglarakyuz,

After running over-weekend, the test passed 81787+ times and still on-going without issue.

Hi all,

I have been working same issue since the beginning of this year. My JetPack version was 3.3 at the beginning while using different TX1 developer kit and other carriers that are provided by your partners, at that time I just checked my project that contains your multimedia API samples.

I had mini test setup contains multiple tx1s to monitor this issue. After a while I see that the JetPack version 4.4 was released so that I upgraded the half of test setup to JetPack ver. 4.4.

To simply my monitoring procedure, I started to work on your samples to understand, I might have some other issues in my project. However, I encountered the exact problem @caglarakyuz have. Both JetPack version(3.3&4.4) has same problem.

If I have 8 encoders with fan on/off, I see the same problem after 1600+ iterations without having any additionals process by me. If I have 4 encoders, it takes more iterations. However, it happens again. Moreover, my project(s) is(are) using other parts like decoder, etc…, in that case this will cause more stress on system.

To see how system will behave in general when I have other process as I mentioned above, I generate some CPU&GPU load with cuda samples and multimedia API decoder sample provided by Nvidia . The fail issue happens earlier than before.

From the debug points and comments, I see that problem might occur on any of these APIs v4l2_ioctl[libv4l2~nvidia headers inside multimedia API], dqBuffer&qBuffer[NvV4l2ElementPlane].

Thanks,

Kemal.