nvJpeg's JPEG encode performance on Ampere A100

Hi All,

I would like to understand if the JPEG enocoding performance given here, uses any underlying hw acceleration ? My assumption is that there is no NVENC on A100, is that right ? Is this performance achieved by CUDA acceleration ? The performance data for JPEG encode given in the link above does not seem to give any utilization of CPU. In this context I would like to understand whether nvJPEG need any assistance from CPU for JPEG enc?

Thanks

Hi Moderator,
Could you please help on this?
Thanks.

There is a new HW JPEG decode engine in A100. There is no corresponding JPEG encode engine. NVENC is for motion video. The nvJPEG library for both encode and decode uses a mix of GPU and CPU resources.

  1. I am assuming that there is no NVENC in A100, So we cannot do HW based jpeg encoding or motion video encoding - is that right ?
  2. The JPEG encoder performance given in Fig 7a of the aforesaid link - shows that A100 can do more that 1000 fps of 4K frame JPEG encode. I am assuming that this perf consumes 100% of GPU. Is there any possibility to know how much CPU (Turbo (Skylake) ) was taken ?
    Thanks

Correct, as indicated here, under NVENC click on “Supported Format Details”, note that the A100 line is all “No”.

There is no dedicated hardware motion video encode engine on A100 and there is no dedicated hardware jpeg encode engine on A100:

You can still do jpeg encode on A100, but there is no hardware engine specifically for it. As indicated in Figure 2 there:

JPEG encoding process employing a parallel utilization of GPU CUDA software and CPU.

I don’t have that information.

It obviously does not consume all GPU. I mean it is just 5 cores:

The A100 includes a 5-core hardware JPEG decode engine. nvJPEG takes advantage of the hardware backend for batched processing of JPEG images.

The question here is about JPEG encode. There is no hardware JPEG encode engine on the A100.

Only CUDA SW implementation, yes.

I am looking to profile performance of this A100 JPEG engine. Few questions in this context -

  1. Could the code given in this blog be used as reference ?
  2. Would it be possible for me to build jpeg decode code (meant to run on x86+A100) on another x86 machine which DOES NOT have A100 in it ? I would like to build the code on non-A100 machine but run it on A100 machine. Would that be possible ?

Thanks.

Hi All,

I am building samples from CUDALibrarySamples. While compiling nvjpeg sample I am running to error -

[ 33%] Building CUDA object CMakeFiles/nvjpegDecoder.dir/nvjpegDecoder.cpp.o
/usr/local/cuda/bin/nvcc   -I/usr/local/cuda/targets/x86_64-linux/include  -Xcompiler=-fPIE   -std=c++11 -x cu -c /home/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp -o CMakeFiles/nvjpegDecoder.dir/nvjpegDecoder.cpp.o
/home/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/nvjpegDecoder.cpp(56): error: identifier "nvjpegJpegStreamParseHeader" is undefined

THis appears to be compiler error rather than linker. I am wondering where is prototype for nvjpegJpegStreamParseHeader defined. I checked nvjpeg.h and did not find it there.

~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ cat /usr/local/cuda-10.2/targets/x86_64-linux/include/nvjpeg.h | grep nvjpegDecodeBatchedSupported

Thanks

You’re using CUDA 10.2

But CUDA 11.0 or newer is required.

(and your grep command is looking for something else)

Thanks Robert. I got carried away by what was reported by nvidia-smi (which shows cuda 11).

nvidia-smi
Mon May 16 09:21:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.172.01   Driver Version: 450.172.01   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100 80GB PCIe      On   | 00000000:61:00.0 Off |                    0 |
| N/A   74C    P0    90W / 300W |    415MiB / 81252MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     19581      C   .../pcoip-agent/pcoip-server      413MiB |
+-----------------------------------------------------------------------------+

Result is same when I look for nvjpegJpegStreamParseHeader

grep nvjpegJpegStreamParseHeader /usr/local/cuda-10.2/targets/x86_64-linux/include/nvjpeg.h

Could there be any discrepancy in cuda installation ? Do I need to install cuda again to see 11.0 headers ?

Thanks

That directory suggests to me you are using CUDA 10.2. For understanding of the difference between that and what is reported by nvidia-smi see here.

Yes, there could be discrepancy in your CUDA installation. I don’t know what you have installed on your machine. If you don’t know either, you may wish to reinstall (or install) CUDA 11.x. Follow the install instructions carefully including set up of environment variables.

Thanks Robert. Would you suggest getting rid of 10.2 that I have on my system, before I follow the instructions in the doc you linked to for installing 11.x ?

Are there specific instructions to uninstall ? I see that there is no ‘uninstall’ script in /usr/local/cuda/bin

ll
 /usr/local/cuda/bin/*cuda*
-rwxr-xr-x 1 root root 4833904 Nov 13  2019 /usr/local/cuda/bin/cudafe++*
-rwxr-xr-x 1 root root 8890248 Nov 13  2019 /usr/local/cuda/bin/cuda-gdb*
-rwxr-xr-x 1 root root  581744 Nov 13  2019 /usr/local/cuda/bin/cuda-gdbserver*
-rwxr-xr-x 1 root root     800 Nov 13  2019 /usr/local/cuda/bin/cuda-install-samples-10.2.sh*
-rwxr-xr-x 1 root root  397480 Nov 13  2019 /usr/local/cuda/bin/cuda-memcheck*

Thanks.

I don’t have any suggestions about CUDA 10.2. If it were me I would not bother uninstalling it.

The linux install guide I linked discusses uninstall methods. For example here.

If you don’t know how you installed CUDA, I don’t know either. The lack of an uninstaller file would typically mean to me that you used a package manager method to install. If you read the document, you’ll get a better idea of what I mean.

You may wish to read the document.

Thanks Robert. After installing 11.7 I was able to run nvjpegDecoder example. I would like to ask question about its perf - which does not look quite right to me (avg decode time per image seems too large) .

./nvjpegDecoder -i ../input_images/ -o output/
Decoding images in directory: ../input_images/, total 12, batchsize 1
Processing: ../input_images/img6.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img6.bmp
Processing: ../input_images/img4.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img4.bmp
Processing: ../input_images/cat_grayscale.jpg
Image is 1 channels.
Channel #0 size: 64 x 64
Grayscale JPEG 
Done writing decoded image to file: output//cat_grayscale.bmp
Processing: ../input_images/img5.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img5.bmp
Processing: ../input_images/cat_baseline.jpg
Image is 3 channels.
Channel #0 size: 64 x 64
Channel #1 size: 64 x 64
Channel #2 size: 64 x 64
YUV 4:4:4 chroma subsampling
Done writing decoded image to file: output//cat_baseline.bmp
Processing: ../input_images/img7.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img7.bmp
Processing: ../input_images/cat.jpg
Image is 3 channels.
Channel #0 size: 64 x 64
Channel #1 size: 64 x 64
Channel #2 size: 64 x 64
YUV 4:4:4 chroma subsampling
Done writing decoded image to file: output//cat.bmp
Processing: ../input_images/img2.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img2.bmp
Processing: ../input_images/img9.jpg
Image is 3 channels.
Channel #0 size: 640 x 480
Channel #1 size: 320 x 240
Channel #2 size: 320 x 240
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img9.bmp
Processing: ../input_images/img1.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img1.bmp
Processing: ../input_images/img3.jpg
Image is 3 channels.
Channel #0 size: 640 x 426
Channel #1 size: 320 x 213
Channel #2 size: 320 x 213
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img3.bmp
Processing: ../input_images/img8.jpg
Image is 3 channels.
Channel #0 size: 480 x 640
Channel #1 size: 240 x 320
Channel #2 size: 240 x 320
YUV 4:2:0 chroma subsampling
Done writing decoded image to file: output//img8.bmp
Total decoding time: 7.9063
Avg decoding time per image: 0.658859
Avg images per sec: 1.51778
Avg decoding time per batch: 0.658859

Could it be because I am not using right driver ? nvidia-smi says (515.43.04)

nvidia-smi
Mon May 16 16:36:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   74C    P0    90W / 300W |    413MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4152      C   .../pcoip-agent/pcoip-server      411MiB |
+-----------------------------------------------------------------------------+

Thanks.

Just to be clear I am trying to compare the performance of JPEG decode with what has been published here.

Thanks.

The information published there, using the test code you are running, is as follows:

Avg decoding time per image: 1.23571
Avg images per sec: 0.809248
Avg decoding time per batch: 1.23571

The data you have provided from your run is:

Avg decoding time per image: 0.658859
Avg images per sec: 1.51778
Avg decoding time per batch: 0.658859

Your data looks better to me.

I imagine, to get maximum throughput, you would need to batch images. The published batch size there for the performance graphs is 128.

Hi Robert,
We are very impressed with and count on jpeg decoder hardware acceleration in A100, as quoted in the blog (> 3000 fps for 1920x1080 at batch size of 128). I am looking to repeat this performance.

I ran decoding of 1920x1080 jpeg images with batch size of 128, but do not get anywhere close to 3000 fps.

~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ ./nvjpegDecoder -i ../../nvJPEG-Decoder/demo_video/frames/ -b 128 >& junk-no-output
~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ ./nvjpegDecoder -i ../../nvJPEG-Decoder/demo_video/frames/ -b 128  -o output/ >& junk-output
~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ tail junk-no-output -n 5
YUV 4:2:0 chroma subsampling
Total decoding time: 175.67
Avg decoding time per image: 0.323518
Avg images per sec: 3.09102
Avg decoding time per batch: 35.1341
~/work/internet/nvidia/cudalibsamples/CUDALibrarySamples/nvJPEG/nvJPEG-Decoder/build$ tail junk-output -n 5
Done writing decoded image to file: output//output_00173.bmp
Total decoding time: 176.713
Avg decoding time per image: 0.325438
Avg images per sec: 3.07278
Avg decoding time per batch: 35.3425

Please help in getting me to the perf stated in the blog.

Thanks

That’s actually a number in milliseconds, not seconds. So you are getting 3091 images/s.