HW accelerated JPEG encoding?

It looks like the following gstreamer jpeg encoding plugins are available: jpegenc, nvjpegenc, nv_omx_jpegenc

The multimedia user guide:

Does not specifically describe what the differences are. Can anyone shed light on the differences between them?

In test code I wrote, it looks like the time to encode a 640x480 grayscale image is ~2ms on the tegra X1 regardless of whether jpegenc or nvjpegenc is used, and both max out the CPU while encoding which makes me think that the encoding is NOT being hardware accelerated. I have not been able to get the nv_omx_jpegenc plugin to work.

jpegenc and nvjpegenc work with the following simple pipline:

$ gst-launch-0.10 filesrc location=test_in.jpg ! jpegdec ! jpegenc ! filesink location=test_out.jpg -e

but nv_omx_jpeg_enc does not.

$ gst-launch-0.10 filesrc location=test_in.jpg ! jpegdec ! nv_omx_jpegenc ! filesink location=test_out.jpg -e
Inside NvxLiteH264DecoderLowLatencyInitNvxLiteH264DecoderLowLatencyInit set DPB and MjstreamingInside NvxLiteH265DecoderLowLatencyInitNvxLiteH265DecoderLowLatencyInit set DPB and MjstreamingSetting pipeline to PAUSED ...
Pipeline is PREROLLING ...
ERROR: from element /GstPipeline:pipeline0/GstFileSrc:filesrc0: Internal data flow error.
Additional debug info:
gstbasesrc.c(2625): gst_base_src_loop (): /GstPipeline:pipeline0/GstFileSrc:filesrc0:
streaming task paused, reason not-negotiated (-4)
ERROR: pipeline doesn't want to preroll.
Setting pipeline to NULL ...
Freeing pipeline ...

Thanks for your post, the encoder team has been investigating your report and will respond soon.

Best regards,
Dusty

Hi,

Please find details of the available jpeg encode plugins :

  1. jpegenc : OSS jpeg encode plugin (sw encode)
  2. nvjpegenc : nvidia accelerated jpeg encode (hw encode)
  3. nv_omx_jpegenc : OSS gst-openmax jpeg encode plugin (for gst-0.10, not recommended)

Use following pipeline for gst-0.10,

gst-launch-0.10 filesrc location=test_in.jpg ! nvjpegdec ! nvjpegenc ! filesink location=test_out.jpg -e

Also, recommend to use gstreamer-1.0 instead using following pipeline.

gst-launch-1.0 filesrc location=test_in.jpg ! nvjpegdec ! nvjpegenc ! filesink location=test_out.jpg -e

Thank you for explaining the differences. It appears my original understanding was correct.

Unfortunately, I am still seeing the strange performance difference between jpegenc and nvjpegenc… nvjpegenc takes twice as much CPU as jpegenc.

I’ve tested with both my own code, and using stock gstreamer pipelines.
In both tests I used tegrastats to monitor, and had run the max_perf script posted here:
https://devtalk.nvidia.com/default/topic/901337/post/4747186/#4747186

Using jpegenc

gst-launch-0.10 videotestsrc is-live=true ! video/x-raw-rgb, framerate=30/1, width=640, height=480 ! jpegenc, quality=90 ! fakesink
RAM 854/3854MB (lfb 2x4MB) SWAP 0/0MB (cached 0MB) cpu [5%,5%,3%,27%]@1912 EMC 3%@1600 AVP 53%@12 VDE 0 GR3D 0%@998 EDP limit 1912

Using nvjpegenc

gst-launch-0.10 videotestsrc is-live=true ! video/x-raw-rgb, framerate=30/1, width=640, height=480 ! nvjpegenc, quality=90 ! fakesink
RAM 854/3854MB (lfb 2x4MB) SWAP 0/0MB (cached 0MB) cpu [7%,4%,54%,1%]@1912 EMC 2%@1600 AVP 32%@12 VDE 0 GR3D 0%@998 EDP limit 1912

If possible, we recommend to use gstreamer-1.0. Mentioned higher CPU usage is because of raw->nv format conversion.
If we can pass NVMM buffers (NV format) to encoder then we will not see higher CPU load. Following experiment will help understand better.

I have generated a mjpeg file with 3000 buffers. This file is decoded using OSS jpegdec. Output of jpegdec is fed to jpegenc.

gst-launch-1.0 filesrc location= enc640x480_mjpeg_3000.mp4 ! qtdemux ! jpegdec ! jpegenc ! filesink location=test_out.jpg -e

Tegrastats results below:

RAM 448/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [0%,100%,0%,0%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 448/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [0%,100%,0%,0%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 448/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [0%,100%,0%,0%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 448/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [3%,100%,0%,0%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 448/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [1%,100%,0%,0%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 448/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [0%,100%,0%,0%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 448/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [1%,100%,0%,0%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 449/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [0%,100%,0%,1%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 449/3854MB (lfb 733x4MB) SWAP 0/0MB (cached 0MB) cpu [0%,100%,0%,0%]@1912 EMC 4%@1600 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912

Now the same file is decoded using nvjpegdec. Output of this is NVMM buffer which is fed to nvjpegenc.
gst-launch-1.0 filesrc location= enc640x480_mjpeg_3000.mp4 ! qtdemux ! nvjpegdec ! nvjpegenc ! filesink location=test_out.jpg -v -e

Tegrastats results below:

RAM 447/3854MB (lfb 732x4MB) SWAP 0/0MB (cached 0MB) cpu [3%,54%,0%,0%]@825 EMC 14%@665 AVP 1%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 446/3854MB (lfb 732x4MB) SWAP 0/0MB (cached 0MB) cpu [7%,3%,54%,0%]@921 EMC 14%@665 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 447/3854MB (lfb 732x4MB) SWAP 0/0MB (cached 0MB) cpu [4%,54%,0%,0%]@825 EMC 14%@665 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 448/3854MB (lfb 732x4MB) SWAP 0/0MB (cached 0MB) cpu [2%,56%,0%,0%]@825 EMC 14%@665 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 448/3854MB (lfb 732x4MB) SWAP 0/0MB (cached 0MB) cpu [4%,32%,23%,1%]@825 EMC 14%@665 AVP 0%@80 VDE 0 GR3D 0%@76 EDP limit 1912
RAM 446/3854MB (lfb 731x4MB) SWAP 0/0MB (cached 0MB) cpu [14%,22%,0%,23%]@204 EMC 14%@665 AVP 0%@115 VDE 0 GR3D 0%@76 EDP limit 1912

As an be seen if NVMM buffers are provided to nvjpegenc, then we do not increase CPU utilization. In summary, the higher CPU load is because of the raw->nv conversion. This is expected result.

Is there somewhere that I can find documentation on the NVMM buffers/formats?

The goal I’m trying to get towards is to leverage the HW compression from a custom application.

appsrc ! nvjpegenc ! appsink

The gst-launch pipelines were just for simple benchmarking… and the performance results matched my custom code.

If you download the gstomx-1.0 source from here: http://developer.download.nvidia.com/embedded/L4T/r23_Release_v1.0/source/gstomx1_src.tbz2
Search it for the terms like: ‘nvbuffer’, ‘nv_buffer’, ‘nvmm’, and ‘CUDA’. Will see if I can find a doc.

Dusty, any luck finding documentation on the NVMM buffers/formats?

Not yet, although we do have the request in with engineering for more info or docs about using NVMM, will follow-up this week.

I am also very interested in this documentation, especially in the difference between video/x-raw and video/x-raw(memory:NVMM).

Dusty, any luck finding documentation on the NVMM buffers/formats?

I saw that the nvidia developer site has sources for gstjpeg


http://developer.download.nvidia.com/embedded/L4T/r23_Release_v1.0/source/gstjpeg_src.tbz2

Is this the source for the nvjpeg plugin? It looks like it is.

Browsing through the source, it appears that if the input data is not in an NVMM buffer then it does a normal S/W encode rather converting to nvmm, which might explain the behavior I’ve been seeing. Is this correct?

Hi,

I am not sure about NVMM but normally when using a HW encoder/decoder they require that the memory has to be aligned to some number and it must be contiguous memory as well and for that reason you cannot use a kernel allocator to get that memory (unless it comes from framebuffer). I need to read the source code of the plugin but my guess is that it is allocating the NVMM buffers from a dedicated heap or using their own memory allocator. dusty_nv is this the case? 

 When a buffer that is not NVMM is received by the element likely the system needs to do a memory copy to a NVMM buffer, causing overhead, even worse if you test with videotestsrc which needs to generate the pattern. In gstreamer 1.0 you can provide a memory allocator for the plugin, there is a property on the elements called peer-alloc that you can set in true so the element downstream could provide this NVMM memory to storage the data but on this case nvjpegenc should support the functions that gstreamer will call to request that memory.

Short answer, in the following pipeline:

appsrc ! nvjpegenc ! appsink

You need to be sure that the memory pushed in the appsink is NVMM memory.

-David