nvJpeg's JPEG encode performance on Ampere A100

ai.trucker · February 2, 2022, 9:51pm

Hi All,

I would like to understand if the JPEG enocoding performance given here, uses any underlying hw acceleration ? My assumption is that there is no NVENC on A100, is that right ? Is this performance achieved by CUDA acceleration ? The performance data for JPEG encode given in the link above does not seem to give any utilization of CPU. In this context I would like to understand whether nvJPEG need any assistance from CPU for JPEG enc?

Thanks

ai.trucker · February 10, 2022, 4:49am

Hi Moderator,
Could you please help on this?
Thanks.

Robert_Crovella · February 10, 2022, 6:50pm

There is a new HW JPEG decode engine in A100. There is no corresponding JPEG encode engine. NVENC is for motion video. The nvJPEG library for both encode and decode uses a mix of GPU and CPU resources.

ai.trucker · February 10, 2022, 7:10pm

I am assuming that there is no NVENC in A100, So we cannot do HW based jpeg encoding or motion video encoding - is that right ?
The JPEG encoder performance given in Fig 7a of the aforesaid link - shows that A100 can do more that 1000 fps of 4K frame JPEG encode. I am assuming that this perf consumes 100% of GPU. Is there any possibility to know how much CPU (Turbo (Skylake) ) was taken ?
Thanks

Robert_Crovella · February 10, 2022, 7:30pm

Correct, as indicated here, under NVENC click on “Supported Format Details”, note that the A100 line is all “No”.

There is no dedicated hardware motion video encode engine on A100 and there is no dedicated hardware jpeg encode engine on A100:

You can still do jpeg encode on A100, but there is no hardware engine specifically for it. As indicated in Figure 2 there:

JPEG encoding process employing a parallel utilization of GPU CUDA software and CPU.

I don’t have that information.

val.zapod.vz · February 17, 2022, 10:09am

It obviously does not consume all GPU. I mean it is just 5 cores:

The A100 includes a 5-core hardware JPEG decode engine. nvJPEG takes advantage of the hardware backend for batched processing of JPEG images.

Robert_Crovella · February 17, 2022, 10:57am

The question here is about JPEG encode. There is no hardware JPEG encode engine on the A100.

val.zapod.vz · February 17, 2022, 1:17pm

Only CUDA SW implementation, yes.

ai.trucker · April 27, 2022, 4:53pm

I am looking to profile performance of this A100 JPEG engine. Few questions in this context -

Could the code given in this blog be used as reference ?
Would it be possible for me to build jpeg decode code (meant to run on x86+A100) on another x86 machine which DOES NOT have A100 in it ? I would like to build the code on non-A100 machine but run it on A100 machine. Would that be possible ?

Thanks.

ai.trucker · May 14, 2022, 12:51am

Hi All,

I am building samples from CUDALibrarySamples. While compiling nvjpeg sample I am running to error -

[ 33%] Building CUDA object CMakeFiles/nvjpegDecoder.dir/nvjpegDecoder.cpp.o
/usr/local/cuda/bin/nvcc   -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler=-fPIE   -std=c++11 -x cu -c /home/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp -o CMakeFiles/nvjpegDecoder.dir/nvjpegDecoder.cpp.o
/home/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp(56): error: identifier "nvjpegJpegStreamParseHeader" is undefined

THis appears to be compiler error rather than linker. I am wondering where is prototype for nvjpegJpegStreamParseHeader defined. I checked nvjpeg.h and did not find it there.

~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ cat /usr/local/cuda-10.2/targets/x86_64-linux/include/nvjpeg.h | grep nvjpegDecodeBatchedSupported

Thanks

Robert_Crovella · May 14, 2022, 1:24pm

You’re using CUDA 10.2

But CUDA 11.0 or newer is required.

(and your grep command is looking for something else)

ai.trucker · May 16, 2022, 4:34pm

Thanks Robert. I got carried away by what was reported by nvidia-smi (which shows cuda 11).

nvidia-smi
Mon May 16 09:21:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.172.01   Driver Version: 450.172.01   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100 80GB PCIe      On   | 00000000:61:00.0 Off |                    0 |
| N/A   74C    P0    90W / 300W |    415MiB / 81252MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     19581      C   .../pcoip-agent/pcoip-server      413MiB |
+-----------------------------------------------------------------------------+

Result is same when I look for nvjpegJpegStreamParseHeader

grep nvjpegJpegStreamParseHeader /usr/local/cuda-10.2/targets/x86_64-linux/include/nvjpeg.h

Could there be any discrepancy in cuda installation ? Do I need to install cuda again to see 11.0 headers ?

Thanks

Robert_Crovella · May 16, 2022, 4:57pm

That directory suggests to me you are using CUDA 10.2. For understanding of the difference between that and what is reported by nvidia-smi see here.

Yes, there could be discrepancy in your CUDA installation. I don’t know what you have installed on your machine. If you don’t know either, you may wish to reinstall (or install) CUDA 11.x. Follow the install instructions carefully including set up of environment variables.

ai.trucker · May 16, 2022, 6:41pm

Thanks Robert. Would you suggest getting rid of 10.2 that I have on my system, before I follow the instructions in the doc you linked to for installing 11.x ?

Are there specific instructions to uninstall ? I see that there is no ‘uninstall’ script in /usr/local/cuda/bin

ll
 /usr/local/cuda/bin/*cuda*
-rwxr-xr-x 1 root root 4833904 Nov 13  2019 /usr/local/cuda/bin/cudafe++*
-rwxr-xr-x 1 root root 8890248 Nov 13  2019 /usr/local/cuda/bin/cuda-gdb*
-rwxr-xr-x 1 root root  581744 Nov 13  2019 /usr/local/cuda/bin/cuda-gdbserver*
-rwxr-xr-x 1 root root     800 Nov 13  2019 /usr/local/cuda/bin/cuda-install-samples-10.2.sh*
-rwxr-xr-x 1 root root  397480 Nov 13  2019 /usr/local/cuda/bin/cuda-memcheck*

Thanks.

Robert_Crovella · May 16, 2022, 7:08pm

I don’t have any suggestions about CUDA 10.2. If it were me I would not bother uninstalling it.

The linux install guide I linked discusses uninstall methods. For example here.

If you don’t know how you installed CUDA, I don’t know either. The lack of an uninstaller file would typically mean to me that you used a package manager method to install. If you read the document, you’ll get a better idea of what I mean.

You may wish to read the document.

ai.trucker · May 16, 2022, 11:36pm

Thanks Robert. After installing 11.7 I was able to run nvjpegDecoder example. I would like to ask question about its perf - which does not look quite right to me (avg decode time per image seems too large) .

./nvjpegDecoder -i ../input_images/ -o output/
Decoding images in directory: ../input_images/, total 12, batchsize 1
Processing: ../input_images/img6.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img6.bmp
Processing: ../input_images/img4.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img4.bmp
Processing: ../input_images/cat_grayscale.jpg
Image is 1 channels.
Channel #0 size: 64 x 64
Grayscale JPEG 
Done writing decoded image to file: output//cat_grayscale.bmp
Processing: ../input_images/img5.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img5.bmp
Processing: ../input_images/cat_baseline.jpg
Image is 3 channels.
Channel #0 size: 64 x 64
Channel #1 size: 64 x 64
Channel #2 size: 64 x 64
YUV 4:4:4 chroma subsampling
Done writing decoded image to file: output//cat_baseline.bmp
Processing: ../input_images/img7.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img7.bmp
Processing: ../input_images/cat.jpg
Image is 3 channels.
Channel #0 size: 64 x 64
Channel #1 size: 64 x 64
Channel #2 size: 64 x 64
YUV 4:4:4 chroma subsampling
Done writing decoded image to file: output//cat.bmp
Processing: ../input_images/img2.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img2.bmp
Processing: ../input_images/img9.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img9.bmp
Processing: ../input_images/img1.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img1.bmp
Processing: ../input_images/img3.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img3.bmp
Processing: ../input_images/img8.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img8.bmp
Total decoding time: 7.9063
Avg decoding time per image: 0.658859
Avg images per sec: 1.51778
Avg decoding time per batch: 0.658859

Could it be because I am not using right driver ? nvidia-smi says (515.43.04)

nvidia-smi
Mon May 16 16:36:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   74C    P0    90W / 300W |    413MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4152      C   .../pcoip-agent/pcoip-server      411MiB |
+-----------------------------------------------------------------------------+

Thanks.

ai.trucker · May 17, 2022, 4:57pm

Just to be clear I am trying to compare the performance of JPEG decode with what has been published here.

Thanks.

Robert_Crovella · May 17, 2022, 9:10pm

The information published there, using the test code you are running, is as follows:

Avg decoding time per image: 1.23571
Avg images per sec: 0.809248
Avg decoding time per batch: 1.23571

The data you have provided from your run is:

Avg decoding time per image: 0.658859
Avg images per sec: 1.51778
Avg decoding time per batch: 0.658859

Your data looks better to me.

I imagine, to get maximum throughput, you would need to batch images. The published batch size there for the performance graphs is 128.

ai.trucker · May 18, 2022, 7:00am

Hi Robert,
We are very impressed with and count on jpeg decoder hardware acceleration in A100, as quoted in the blog (> 3000 fps for 1920x1080 at batch size of 128). I am looking to repeat this performance.

I ran decoding of 1920x1080 jpeg images with batch size of 128, but do not get anywhere close to 3000 fps.

~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ ./nvjpegDecoder -i ../../nvJPEG-Decoder/demo_video/frames/ -b 128 >& junk-no-output
~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ ./nvjpegDecoder -i ../../nvJPEG-Decoder/demo_video/frames/ -b 128  -o output/ >& junk-output
~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ tail junk-no-output -n 5
YUV 4:2:0 chroma subsampling
Total decoding time: 175.67
Avg decoding time per image: 0.323518
Avg images per sec: 3.09102
Avg decoding time per batch: 35.1341
~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ tail junk-output -n 5
Done writing decoded image to file: output//output_00173.bmp
Total decoding time: 176.713
Avg decoding time per image: 0.325438
Avg images per sec: 3.07278
Avg decoding time per batch: 35.3425

Please help in getting me to the perf stated in the blog.

Thanks

Robert_Crovella · May 18, 2022, 4:37pm

That’s actually a number in milliseconds, not seconds. So you are getting 3091 images/s.

Topic		Replies	Views
Leveraging the Hardware JPEG Decoder and NVIDIA nvJPEG Library on NVIDIA A100 GPUs Technical Blog	0	925	August 25, 2020
A100 Hardware NVJPEG Batch Decoding takes ~5ms before decoding and why GPU-Accelerated Libraries hw , cuda , kernel , ubuntu	11	1512	September 21, 2022
HW accelerated JPEG encoding? Jetson TX1	12	11482	March 20, 2016
nvjpegdec slower then jpegdec in gstreamer Jetson AGX Xavier	38	8677	October 18, 2021
Opening an image file direct from GPU CUDA Programming and Performance	1	1441	October 22, 2008
Capture MJPEG stream from webcam using nvjpegdec hardware acceleration Jetson Nano	18	4704	October 14, 2021
nvjpegdec can decode a jpeg to RGB format? Jetson TK1	8	2371	October 18, 2021
NvJPEGDecoder ultra slow (Xavier & Nano) Jetson AGX Xavier	2	926	October 18, 2021
Nvmimg_jpgenc supported yuv format and performace issue DRIVE AGX Xavier General driveos-nvmedia	18	1892	October 12, 2021
Jetson nano developer kit nvjpeg encoder's speed so slowly. it encode YUV420_7264X4112_Pic 100 times need 20s Jetson Nano mmapi	23	1961	October 18, 2021

nvJpeg's JPEG encode performance on Ampere A100

Related topics