Jetson Nano HW encoder/Decoder conflicting documentation

Hi,

I’m seeking clarification on hw accelerated decoding and encoding of 10 bit and 12 bit HEVC content on board the jetson nano.

Specifically, the issue is that the documentation provides contradicting information. Per the Developer guide here the Nano is listed as only supporting 8 bit formats of HEVC.

However per the nvidia gstreamer documentation, HW accelerated decode and encode are supported all the way to 12 bit. It even mentions special flags to increase performance on low memory devices, such as the jetson nano series.

As such, I have a few questions:
1 Which of these documents should I consider to be correct
2 In the case where the first document is correct, how is gstreamer supporting higher bit depth content then the native decode/encode blocks.
3 In the case where the second document is correct, how would I access this capability outside of Gstreamer?
4 In the case where the Second document is correct, what is the performance penalties for using higher bit depth content?
5 In the case where both documents are correct, what is the expected return? a file that has the “extra” bit depth truncated?

Thanks,

FCLC

Hi,
Please check this:
https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/software_features_jetson_nano.html#wwpID0EZHA
The further formats are supported on TX2 or Xavier.

Hi @DaneLLL !

The link you shared is the same as the one in the first post. The issue is that it directly contradicts another part of the Nvidia documentation

What confuses me is that the gstreamer documentation (second link) explicitly discusses having the Nano decode 10/12 bit content, and how to get extra performance out of it.

That’s why I have those 5 questions :sweat_smile:

Hi,
On Jetson Nano, we support YUV420 8-bit in encoding/decoding. YUV420 10-bit and 12-bit are not supported.

Gotcha! might be worth updating the gstreamer documentation in that case!

Otherwise, is their a way you know of to simply truncate the last 2 bits of a 10bit yuv input to 8 bit, so that the decoder can still be used?

The way I read the second document is that the notes are not related to the 12-bit decoding format, but instead just related to how to generally use gstreamer on lower memory devices.

The formatting is somewhat unfortunate, because all the paragraphs run into each other, and it seems as if the “note” is about the 12-bit format, but that’s probably not the case.

The documentation explains the “full range” of gstreamer (for all the devices) but the Nano doesn’t actually do everything that the gstreamer would support if it did.

1 Like

Yeah, part of the reason I thought it might be the fifth option; decoding from 10/12 bits to 8 bit output by truncating last 2/4 bits.

Specifically, right after the note on the nano, it provides:

gst-launch-1.0 filesrc location=<filename.mp4> !
qtdemux ! h265parse ! omxh265dec enable-low-outbuffer=1 !
‘video/x- raw(memory:NVMM), format=(string)NV12’ ! fakesink sync=1 -e

Unless I’m misunderstanding the code example, regardless of the hardware the above is run on, the content will be output to nv12, which can only ever be 8bit

If it does/can decode higher bit depth content, but can only output to yuv420 (instead of yuv420p10le) That’s totally fine- but its ambiguous at best right now.

Heck; if it does the conversion to 8bit directly, that’s actually very convenient for my use case! (8k/4k HDR 10bit to 2k 8 bit SDR)

Thanks for the reply!

In that case, is it possible to decode the frames using CUDA cores instead of the hardware blocks? I’m aware this would come with a performance penalty compared to the dedicated hardware block, but better to take a performance hit instead of failing completely. It seems that the capability existed in prior version of the CUDA SDK. Documentation here

However I can only find references to using the dedicated hardware blocks for supported formats, but can’t seem to find anything about it failing gracefully and falling back to cudacores/ SM’s for the decoding process.

Do you have any insight on this functionality and how to leverage it?

As for using Xavier or TX2, unfortunately the webstore is still closed, so the TX2 doesnt seem to be an option

Hi,
@FCLC are you able to enable YUV420 10-bit decoder based on

By default we leverage hardware encoding and decoding blocks in Jetson Nano, and YUV420 10-bit is not supported. Possible solution is to use GPU but there is not existing CUDA code for it. May need some efforts to check decoding algorithm and do implementation.

Unfortunately the tonemapping filter I designed for that post/thread needs to take decoded footage as input before it can be used.

It seems that in CUDA 8 the ability to use CUDA cores to handle the encoding and decoding of content was available, but it’s been deprecated in more recent versions. They were nvcuvid and nvcuvenc but they haven’t been updated unfortunately.

From what I can tell, the language/SDK/API has no native mechanism in modern versions to fall back on CUDA in the case were a specific format is unsupported.

Now, in the case of discrete cards in a desktop/typical x86 machine, you’d fall back to CPU, but on the jetson, across different codecs/bitrates/resolutions using all software vs all hardware creates a literal order of magnitude difference in performance (and the jetson was in 5w mode as compared to NVMAX+ jetson Clocks to try and give every advantage to software handling)

As of now I see 3 solutions:

Patch the jetpack driver/firmware to allow the jetson to intake 10 bit into the decoder, with the knowledge that the hardware encoder will truncate the last 2 bit’s of every channel. You would not be able to leverage the increase quality of 10-bit input, but it would allow for some functionality. The API would probably throw a warning to console saying that it is an imperfect decode, but it’s better than nothing (I believe this would be the easiest solution, but would only apply to the nano)

A feature request is accepted and delivered reintroducing a more modern implementation of the CUDA based decoder that is then available (but must be strictly enabled by the program) as a graceful fallback when a specific profile/codec/codec sub specification is unsupported on the hardware in question. This would be useful on any and all cuda compliant devices at the known cost of lower per stream performance and/or higher power consumption for a given stream. From an eco system standpoint (and with the AV1 codec coming in soon) this would allow for an expansion of capability across all CUDA devices and provide the ability for developers to support a larger amount of devices without having to split workloads between host CPU and host GPU depending on different configurations.
The implementation would go from the current implementation of

// Check if content is supported
if (!decodecaps.bIsSupported){
NVDEC_THROW_ERROR(Codec not supported on this GPU", CUDA_ERROR_NOT_SUPPORTED);

To now being

// Check if content is supported in hardware
if (!decodecaps.bIsSupported){
//check if cuda compatibility decoder is allowed AND enabled
if (decode_cuda_enabled == false)
{
NVDEC_THROW_ERROR(Codec not supported on this GPU", CUDA_ERROR_NOT_SUPPORTED);
}
if (decode_cuda_enabled == true)
{
//something that remaps cuvidCreateDecoder() to a new funtcion call that implements a cuda based decoder as the decoder entry points. Perhaps cuvidCreateCudaDecoder()
printf(“Cuda compatibility decoder enabled, performance will be reduced!”);
}

Then the pipeline would continue as it normally would for the NVDEC API

This is by far the nost prefered scenario, as it enables greater compatibility and less headaches for any developers targeting a large diversity of user systems

Somehow put together a custom CUDA based encoder of my own specifically targeted for the Nano that is limited in feature set and is very application specific.

That’s my concern. Ideally If I could get the source/ or general outline for the CUDA 8 nvcuvid functions I could implement them and update them for other codecs

Update:

It seems I’m not the first to encounter this.

It seems that the nvdec used to be able to blindly accept HEVC 10 and decode it down to 8 bit via truncation of the last 2 bit’s. This thread’s OP is a little incoherent, but the third post does seem to have some chance of working: Will HEVC Main 10 Profile decoding work using CUVID? When? - #3 by philipl

I’m having trouble finding more than this pull request that changes 2 lines: cuvid: Pass bit depth information to decoder · FFmpeg/FFmpeg@289a6bb · GitHub

It might be possible to remove the cuvid “protections” that deals with not outputting 8 bit from a 10 bit source. This will presumably create some image artifacts, but it’s better than nothing.

Thankfully because of the way HDR content is stored (Linear light with metadata to add contrast, saturation, and colour) these artifacts may not be relevant. (since the Dynamic range isn’t squeezed or stretched for gamma)

Further discussion usage of the jetson nano being able to decode 10-bit:
Smart-Radio-Video-Streaming-with-the-NVidia-Jetson-Nano-v0520.pdf (789.8 KB)

The documentation is conflicting, and I’ve not had time/patience to deal with the nano and it’s conflicting documentation of late, but hope to return when I take my next sabbatical.