Video Decoding Errors

Hi

I am writing a video player with smooth fast forward / reverse playback support running on an Jetson AGX Xavier. It relies on the V4L2 Video Decoding device and NvBufSurface management.

In order to have smooth forward and backward playback, I use a big amount of surfaces in order to store at lease 2 decoded GOP. Each surface has a single buffer (batchSize = 1). Capture plane uses DMA BUF memory type buffer.

I am facing a lot of undocumented errors during decoding. At some point it always ends like this :

NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
...
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
NVDEC_COMMON: Surface not registered 
decoderPushSurfFalconMethodRelocShift failed for Luma iteration i = 0 
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
NVDEC_COMMON: Surface not registered 
decoderPushSurfFalconMethodRelocShift failed for Chroma iteration i = 0 
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
NVDEC_COMMON: Surface not registered 
decoderPushSurfFalconMethodRelocShift failed for Luma iteration i = 1 
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
NVDEC_COMMON: Surface not registered 
decoderPushSurfFalconMethodRelocShift failed for Chroma iteration i = 1 
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
NVDEC_COMMON: Surface not registered 
...
decoderPushSurfFalconMethodRelocShift failed for Luma Last frame
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
NVDEC_COMMON: Surface not registered 
decoderPushSurfFalconMethodRelocShift failed for Chroma Last frame 
NvRmHost1xStreamEndClass: Requested number of operations in StreamBegin not yet pushed
NvRmHost1xStreamEnd: StreamEndClass failed, err = 8
NVDEC_COMMON: Stream end failed
tegraH264DecoderDecode: Call to pushNVDECStreamEnd failed 
TEGRA_NVDEC_H264: Stream flush failed err = 8
NvMediaVideoDecoderRenderPriv: failed to decode the picture
NVMEDIA: cbDecodePicture: 1647: NVMEDIAVideoDecoderRender failed!

Whereas all the provided file descriptors to capture plane refer to a perfectly allocated an valid surface.

What can be the root cause of those errors, and how can I get more information on the issue ? Is there any environment variable like DBG_NVBUFSURFTRANSFORM to provide debug information ?

I am using the Jetson Linux SDK 35.4.1 but have the same issue on all tested version, from 35.x to 36.2.

Regards
Thomas

May I add :

In my current implementation, each surface has a single buffer (batchSize = 1), as in the miscellaneous samples provided by nvidia. However, if I modify my code to use a single surface with multiple buffers (let’s say batchSize = 200), I don’t have anymore any decoding errors.

But, only buffer 0 of the surface will have valid content (image properly decoded). All other buffers of the surface have invalid content. If I dump them, all bytes are 0x00.

Regards
Thomas

Hi,
We would suggest developer your use-case based on the sample:

/usr/src/jetson_multimedia_api/samples/00_video_decode/

And latest releases for Xavier is Jetpack 4.6.4(NvBuffer APIs) and 5.1.2(NvBufSurface APIs). If you use previous version, please consider upgrade to latest version.

From the error, it looks like the buffer are not allocated through NvBufSurface(or NvBuffer) APIs. Or the buffers are not correctly queued to capture plane. Would suggest try default 00_video_decode sample as a reference app.

Hi DaneLLL

Thanks for taking time to answer.

Unfortunately using the high level classes like NvVideoDecoder or NvBuffer as in the 00_video_decode sample isn’t an option in my case.

In order to have smooth playback, my player will fully decode a GOP, and a second one (before or after depending play direction). Then the player will estimate the frame to present inside the GOP according to the play speed and direction. If I want to support video files with a maximum GOP size of 120, then I need 240 surfaces (batchsize = 1), and 16 more for the decode picture buffer.

This is something I already implemented in the past using ffmpeg and VA-API and it works fine. Now I am porting it to the Nvidia API.

When requesting DMABUF buffers on the capture plane, it will only request up to 64 buffers. This isn’t an issue, because when I queue one of the 64 v4l2 buffer on the capture plane, I will assign it one of the 256 surface file descriptor (1 surface ↔ 1 fd as batchSize = 1).

Depending the video file I try to playback, I’ll have those NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered error occurs when dequeing the very first frames, or after a hundred of dequeued capture buffer.

Right now I am using a custom yocto linux distribution, based on meta-tegra and JetPack 5.1.2. I also tried different versions, all with the same issue.

Unfortunately I fail so far to reproduce this issue with the 00_video_decode sample, as my code is very different.

I confirm that all DMABUF buffers queued to capture plane are using the file descriptor from a NvBufSurface. I know the surface is valid because when the error happens, I can successfully call NvBufSurfaceFromFd with the file descriptor queued to the capture plane (otherwise I would get dmabuf_fd mapped entry NOT found error)

Some differences from the 00_video_decode sample :

  • the V4L2 device is opened in non blocking mode
  • I have a single thread handling both the output and the capture plane

I know as long as I cannot reproduce this issue with the sample app it will be difficult to debug. Right now I’m just trying to figure out the possible causes of the NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered error. Are there some undocumented limitation regarding NvBufSurface usage with the decoder ? Some known bugs ?

My issue seems similar to JPEG encoding reports “Surface not registered” after hundreds of images on Jetpack 5.x

Regards
Thomas

Additional information, which let me think it’s either a bug in the NvBufSurface API or V4L2 video decoder :

As I said earlier, I have a single thread handling both output and capture plane, and the V4L2 video decoder device is opened in non-blocking mode. I have a GOP decode function which basically does

forever {
    read a complete encoded frame from stream
    queue encoded frame to output plane
    try to dequeue already processed encoded frame from output plane
    queue DMABUF to capture plane
    try to dequeue decoded frame from capture plane
}

If I add a delay at the end of the loop, I don’t have any errors, and the decoding process works well.

All the code is sequential, running in a single thread, and surfaces are not shared (yet) with any other thread.

When the decoder is instanciated, the nvidia API will create additional threads (cuda-EvtHandlr and V4L2_DecThread). Could there be some timings oddities leading to those errors ?

I can confirm the lower the delay is in the decoding loop, the higher the error occurrence is.

It looks like if the nvidia instanciated threads (cuda-EvtHandlr and V4L2_DecThread) aren’t scheduled quickly enough, I run into thoses issues.

Hi,
Do you use Jetpack 5.1.2 or 4.6.4?

Hi

I am using Jetpack 5.1.2.

Hi,
As we have discussed in
Decoder only fills first buffer of NvBufSurface
The issue should be due to your customization. Not sure about your use-case bu if you need more buffers for decoded frame data, please extend this value:

 ctx->numCapBuffers = min_dec_capture_buffers + ctx->extra_cap_plane_buffer;

Hi

I’ve done some progress and managed to reproduce the issue in a sample application. Please see the sample code in the attached archive decode_error.tar.gz.

This is a sample decode application which handles output and capture plane in a single thread. You can specify the number of capture plane buffers and number of surfaces (to replicate my application). The idea is to allocate a pool of NvBufSurfaces larger than the number of requested capture plane buffers. The sample app will then pick available surfaces to queue them in the capture plane.

There is no display support, but you can dump the result to a YUV file and visualize it with YUVviewer (it’s converted to NV12 before dumping to file).

Here is the syntax :

Usage: ./decode_error -i input [-o output] [-c capture] [-s surface] [-d delay] [-y] [-h]

options:
    -i, --input <file>        video file to playback
    -o, --output <file>       save decoded images to file
    -n, --num-output <num>    number of output buffers (default 32)
    -c, --num-capture <num>   number of capture buffers (default 32)
    -s, --num-surface <num>   number of surfaces to allocate (default 128)
    -d, --delay <delay>       delay to introduce in decoding loop (in milliseconds)
                              default is 20us
    -y, --yield               call sched_yield between each operation
    -h, --help                this help message

For example :

./decode_error -i bbb_sunflower_1080p_30fps_normal_10s.m2ts -d 0 -n 32 -c 32 -s 64

will work, whereas

./decode_error -i bbb_sunflower_1080p_30fps_normal_10s.m2ts -d 0 -n 32 -c 32 -s 128

will trigger errors

As long as the number of surfaces is lower than 64 (maximum number of requested capture plane buffer), everything works fine. As soon as the number of surfaces is strictly higher than 64, we start having errors like :

reference in DPB was never decoded
...
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
...
decoderPushSurfFalconMethodRelocShift failed for Luma iteration i = 0 
NVDEC_COMMON: NvDecGetSurfPinHandle : Surface not registered 
NVDEC_COMMON: Surface not registered 
decoderPushSurfFalconMethodRelocShift failed for Chroma iteration i = 0 
...
NvRmHost1xStreamEndClass: Requested number of operations in StreamBegin not yet pushed
NvRmHost1xStreamEnd: StreamEndClass failed, err = 8
NVDEC_COMMON: Stream end failed
tegraH264DecoderDecode: Call to pushNVDECStreamEnd failed 
TEGRA_NVDEC_H264: Stream flush failed err = 8
NVMMLITE_NVVIDEODEC, <cbDecodePicture, 3919> ErrorInfo = VideoErrorInfo_NvVideoDecoderDecode cctx = 0xd5bde490
NvVideoDecoderDecode failed
...

So is it possible to use the decoder with more than 64 different surfaces ? (I am not talking of the maximum number of capture buffers)

Regards,
Thomas
decode_error.tar.gz (7.3 KB)

Hi,
This looks to be a rare use-case to have 128 NvBufSurface in capture plane. We will check if we can support it in the futurre but on current release, please not to allocate the buffer number to avoid this issue.

Hi DaneLLL

Just to clarify, I don’t need to increase the maximum number of capture buffer (64). With DMABUF capture buffer, as we pass the file descriptor of a surface, I was just trying to figure out why it failed to use a larger pool of surfaces (for caching decoded images).

I can workaround the issue by using two different pools : the big one to store at least 2 fully decoded GOPs, an a smaller one that complies with video decoder requirements. Then when a frame has been decoded in the small pool, I will copy/convert from the small pool to the big pool.

This of course induces an additional copy where I try to be zero copy everywhere but it seems the only solution yet, as my alternative method here with one surface and batchSize > 1 isn’t supported either.

So for me topic close, as the issue has been identified and a workaround is possible.

Regards
Thomas