Jetson TX2 decoder frame delay?

Hello,

We are using the V4L2 (/dev/nvhost-nvdec) decoder directly (no gstreamer) on the TX2 (Jetpack 3.3). We are seeing significant frame delay capturing frames from it. It seems we need to submit quite a few (encoded) frames to the decoder (via output buffers) before we get any decoded frames returned (via capture buffers). Our goal is to use camera input @ 30 fps, encode the frames, send them over the wire and decode them. This needs to be as quick as possible.

We can demonstrate the issue using the examples. First lets start with a known YUV420 set of frames (we set ctx->out_pixfmt = 0 in the 00_video_decode example for YUV420 output instead of NV12):

./video_decode H264 -o ../../data/Video/sample_outdoor_car_1080p_10fps.yuv420 ../../data/Video/sample_outdoor_car_1080p_10fps.h264

to give us a plain YUV420 file to start from. Now let’s encode these YUV frames and simply print out the bytes used for each frame when we write them to the file(using the defaults in the encoding example), so we can see the start byte of each frame in the resulting .h264 file:

./video_encode ../../data/Video/sample_outdoor_car_1080p_10fps.yuv420 1920 1080 H264 ~/test_car_encode.h264

FRAME 0 written = 23 bytes (23 total)
FRAME 1 written = 138010 bytes (138033 total)
FRAME 2 written = 3808 bytes (141841 total)
FRAME 3 written = 6725 bytes (148566 total)
FRAME 4 written = 7788 bytes (156354 total)
FRAME 5 written = 9013 bytes (165367 total)
FRAME 6 written = 10614 bytes (175981 total)
FRAME 7 written = 10172 bytes (186153 total)
FRAME 8 written = 10840 bytes (196993 total)
FRAME 9 written = 12142 bytes (209135 total)
FRAME 10 written = 14838 bytes (223973 total)
FRAME 11 written = 17351 bytes (241324 total)
FRAME 12 written = 13594 bytes (254918 total)
FRAME 13 written = 15241 bytes (270159 total)
FRAME 14 written = 12704 bytes (282863 total)
FRAME 15 written = 16811 bytes (299674 total)
FRAME 16 written = 13875 bytes (313549 total)
FRAME 17 written = 14583 bytes (328132 total)
FRAME 18 written = 10198 bytes (338330 total)
FRAME 19 written = 17245 bytes (355575 total)
FRAME 20 written = 20800 bytes (376375 total)
FRAME 21 written = 19596 bytes (395971 total)
FRAME 22 written = 15527 bytes (411498 total)
FRAME 23 written = 15072 bytes (426570 total)
FRAME 24 written = 16139 bytes (442709 total)
FRAME 25 written = 16582 bytes (459291 total)
FRAME 26 written = 19419 bytes (478710 total)
FRAME 27 written = 19089 bytes (497799 total)
FRAME 28 written = 11095 bytes (508894 total)
FRAME 29 written = 16387 bytes (525281 total)
FRAME 30 written = 15211 bytes (540492 total)
FRAME 31 written = 97565 bytes (638057 total)
...

Now lets decode the file.

./video_decode H264 -o ~/test_car_decode.yuv420 ~/test_car_encode.h264

And closely watch how many bytes (chunks) we need to read/submit (using a chunk size of 10,000) we see:

Process video_decode created; pid = 25731
Listening on port 333
Remote debugging from host 192.168.229.30
Set governor to performance before enabling profiler
Failed to query video capabilities: Inappropriate ioctl for device
NvMMLiteOpen : Block : BlockType = 261
TVMR: NvMMLiteTVMRDecBlockOpen: 7647: NvMMLiteBlockOpen
NvMMLiteBlockCreate : Block : BlockType = 261
output buffer 0 = 10000 bytes (10000 bytes total)
Starting decoder capture loop thread
output buffer 1 = 10000 bytes (20000 bytes total)
output buffer 2 = 10000 bytes (30000 bytes total)
output buffer 3 = 10000 bytes (40000 bytes total)
output buffer 4 = 10000 bytes (50000 bytes total)
output buffer 5 = 10000 bytes (60000 bytes total)
output buffer 6 = 10000 bytes (70000 bytes total)
output buffer 7 = 10000 bytes (80000 bytes total)
output buffer 8 = 10000 bytes (90000 bytes total)
output buffer 9 = 10000 bytes (100000 bytes total)
output buffer 10 = 10000 bytes (110000 bytes total)
output buffer 11 = 10000 bytes (120000 bytes total)
output buffer 12 = 10000 bytes (130000 bytes total)
TVMR: cbBeginSequence: 1179: BeginSequence  1920x1088, bVPR = 0
TVMR: LowCorner Frequency = 0
TVMR: cbBeginSequence: 1529: DecodeBuffers = 17, pnvsi->eCodec = 4, codec = 0
TVMR: cbBeginSequence: 1600: Display Resolution : (1920x1080)
TVMR: cbBeginSequence: 1601: Display Aspect Ratio : (1920x1080)
TVMR: cbBeginSequence: 1669: ColorFormat : 5
TVMR: cbBeginSequence:1683 ColorSpace = NvColorSpace_YCbCr601
TVMR: cbBeginSequence: 1809: SurfaceLayout = 3
TVMR: cbBeginSequence: 1902: NumOfSurfaces = 24, InteraceStream = 0, InterlaceEnabled = 0, bSecure = 0, MVC = 0 Semiplanar = 1, bReinit = 1, BitDepthForSurface = 8 LumaBitDepth = 8, ChromaBitDepth = 8, ChromaFormat = 5
TVMR: cbBeginSequence: 1904: BeginSequence  ColorPrimaries = 2, TransferCharacteristics = 2, MatrixCoefficients = 2
Video Resolution: 1920x1080
[INFO] (NvEglRenderer.cpp:109) <renderer0> Setting Screen width 1920 height 1080
output buffer 13 = 10000 bytes (140000 bytes total)
Query and set capture successful
output buffer 14 = 10000 bytes (150000 bytes total)
output buffer 15 = 10000 bytes (160000 bytes total)
output buffer 16 = 10000 bytes (170000 bytes total)
output buffer 17 = 10000 bytes (180000 bytes total)
output buffer 18 = 10000 bytes (190000 bytes total)
output buffer 19 = 10000 bytes (200000 bytes total)
output buffer 20 = 10000 bytes (210000 bytes total)
output buffer 21 = 10000 bytes (220000 bytes total)
output buffer 22 = 10000 bytes (230000 bytes total)
output buffer 23 = 10000 bytes (240000 bytes total)
output buffer 24 = 10000 bytes (250000 bytes total)
output buffer 25 = 10000 bytes (260000 bytes total)
output buffer 26 = 10000 bytes (270000 bytes total)
output buffer 27 = 10000 bytes (280000 bytes total)
output buffer 28 = 10000 bytes (290000 bytes total)
output buffer 29 = 10000 bytes (300000 bytes total)
output buffer 30 = 10000 bytes (310000 bytes total)

<IT IS AT THIS POINT WHERE WE SEE THE FIRST FRAME CAPTURED FROM THE CAPTURE THREAD>

For testing, we are sleeping for 1 second after we submit each (encoded) output buffer with a chunk of encoded data to accurately determine how many bytes need to be submitted before we see a capture frame. This tells us it is not until byte 310000 is submitted (aka frame #16, from the line

FRAME 16 written = 13875 bytes (313549 total)

from earlier) until we see any capture buffer.

How do we reduce this latency? I’d like to be around 1-3 frames if possible

Please refer to
https://devtalk.nvidia.com/default/topic/1047641/jetson-tx2/tx2-h264-decode-query/post/5316981/#5316981

Great, thanks.