Nvmimg_2d performance issue

Please provide the following info (check/uncheck the boxes after clicking “+ Create Topic”):
Software Version
[+] DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version
other

Target Operating System
[+] Linux
QNX
other

Hardware Platform
[+] NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)
other

SDK Manager Version
1.5.0.7774
other

Host Machine Version
[+] native Ubuntu 18.04
other

Hi Guys,

I am using nvmimg_2d to do yuv422 to yuv420 converting . The processing time for per frame is up to 1.66 ms.
Since this converting is processed by hardware engineer, it cost too much time than we expected.

$nvmimg_2d -cf sample_yuv422_to_yu420.cfg -v 3
nvmedia: createSurface: NvMediaImageCreate:: Image size: 1280x720 Image type: 42
nvmedia: createSurface: NvMediaImageCreate:: Image size: 1280x720 Image type: 25
nvmedia: WriteImage : Saving output image into file...
nvmedia: Processing time per frame 1.6600 ms 
nvmedia: destroySurface: Destroying surfaces
nvmedia: destroySurface: Destroying surfaces

The attachment is our sample_yuv422_to_yu420.cfg file.
sample_yuv422_to_yu420.cfg (5.8 KB)

In order to do comparison, we implement yuv422 to yuv420 with CPU loop function. YUV422To420/1920/1080 cost about 0.372 ms. It doesn’t make sense that hardware engine converting is slower than cpu method.

Can you help me out?

Dear @Peter_Pertrili,
Could you share input video file and CPU function to reproduce on my machine

@SivaRamaKrishnaNV

I have attached our own CPU function below for you ref.

About input video files, I have try to upload it. But it failure for me upload, I have no idea.
Anyway, the input video files is irrelevant with final performance issue. You can feed any fake file as this input video files. And rename it as old_town_yuv422.yuv (assuming its size is more than 1.8MB).

 void yuv422pTo420pLoop(uint8_t* src, uint8_t* dst, int width, int height) {                                                                                                                                                                                                                                                                                                            
   //Copy Y part
   for (int row = 0; row < height; row++) {                                                                                             
     for (int col = 0; col < width; col++) {                                                                                            
         *dst++=*src++;
     } 
   }   
  
   //Copy UV part
   for (int row = 0; row < height; row++) {                                                                                             
       for (int col = 0; col < width / 2; col++) {                                                                                      
           *dst++=*src++;
       }
       src += width / 2;                                                                                                                
   } 

Dear @Peter_Pertrili,
You may share a link to shared drive. I want to use the same file so that results would be same and comparable

@SivaRamaKrishnaNV

OK. Please try to download the file from this link address.
https://media.xiph.org/video/derf/y4m/old_town_cross_422_720p50.y4m

After download completely, please rename the file name as old_town_yuv422.yuv.

And also I have updated the CPU loop function part in previous reply.

Dear @Peter_Pertrili ,
It would be great if you can attach your complete CPU timing measure code as well. So that we can compile and run locally. Also, we can verify the timing mechanism.

@SivaRamaKrishnaNV

I have merged my own CPU loop function with img_2d source code.
Please apply below patch on your side.

As you can see the patch mainly include below two changes.

  1. Force to do compiler optimization with -O3
  2. Read original file to src_img → call yuv422pTo420pLoop and store into dst_img
diff --git a/nvmedia/img_2d/Makefile b/nvmedia/img_2d/Makefile
index 5452889..c1e7b9f 100644
--- a/nvmedia/img_2d/Makefile
+++ b/nvmedia/img_2d/Makefile
@@ -10,7 +10,7 @@ include ../../../make/nvdefs.mk
 
 TARGETS = nvmimg_2d
 
-CFLAGS   := $(NV_PLATFORM_OPT) $(NV_PLATFORM_CFLAGS) -I. -I../utils
+CFLAGS   := $(NV_PLATFORM_OPT) $(NV_PLATFORM_CFLAGS) -O3  -I. -I../utils
 CFLAGS   += -DNVMEDIA_NVSCI_ENABLE
 CPPFLAGS := $(NV_PLATFORM_SDK_INC) $(NV_PLATFORM_CPPFLAGS)
 LDFLAGS  := $(NV_PLATFORM_SDK_LIB) $(NV_PLATFORM_LDFLAGS)
diff --git a/nvmedia/img_2d/image2d.c b/nvmedia/img_2d/image2d.c
index f060a7f..1a9970a 100644
--- a/nvmedia/img_2d/image2d.c
+++ b/nvmedia/img_2d/image2d.c
@@ -64,6 +64,23 @@ destroySurface(NvMediaImage *image)
     NvMediaImageDestroy(image);
 }
 
+void yuv422pTo420pLoop(uint8_t* src, uint8_t* dst, int width, int height) {
+   //Copy Y part
+   for (int row = 0; row < height; row++) {
+     for (int col = 0; col < width; col++) {
+         *dst++=*src++;
+     }
+   }
+
+   //Copy UV part
+   for (int row = 0; row < height; row++) {
+       for (int col = 0; col < width / 2; col++) {
+           *dst++=*src++;
+       }
+       src += width / 2;
+   }
+}
+
 static NvMediaStatus
 blit2DImage(Blit2DTest *ctx, TestArgs* args)
 {
@@ -71,7 +88,29 @@ blit2DImage(Blit2DTest *ctx, TestArgs* args)
     NvMediaImageSurfaceMap surfaceMap;
     uint64_t startTime,endTime;
     uint64_t end1Time;
-    double processingTime;
+    double processingTime = 0;
+    uint8_t * src_img;
+    uint8_t * dst_img;
+    uint32_t size = args->srcSurfAllocAttrs[0].value * args->srcSurfAllocAttrs[1].value * 2;
+
+    FILE * file_ptr = NULL;
+    file_ptr = fopen(args->inputFileName, "rb");
+    src_img = malloc(size);
+    dst_img = malloc(size);
+    if(fread(src_img, size, 1, file_ptr) != 1) {
+        LOG_ERR("%s: Error reading file: %s\n", __func__, args->inputFileName);
+    }
+    GetTimeMicroSec(&startTime);
+    yuv422pTo420pLoop(src_img, dst_img, args->srcSurfAllocAttrs[0].value, args->srcSurfAllocAttrs[1].value);
+    GetTimeMicroSec(&endTime);
+    processingTime = 0;
+    processingTime += (double)(endTime - startTime)/1000.0;
+
+    LOG_INFO("Current Allocate size is %d(%d*%d)\n", size, args->srcSurfAllocAttrs[0].value, args->srcSurfAllocAttrs[1].value);
+    LOG_INFO("CPU Loop Processing time per frame %.4f ms \n", processingTime);
+    fclose(file_ptr);
+    free(src_img);
+    free(dst_img);
 
     processingTime = 0;
     status = ReadImage(args->inputFileName,                         /* fileName */

nvming_2d default optimized level is -O2.

  • Performance result with -O2 option
nvmedia: createSurface: NvMediaImageCreate:: Image size: 1280x720 Image type: 42
nvmedia: Current Allocate size is 1843200(1280*720)
nvmedia: CPU Loop Processing time per frame 1.3570 ms 
nvmedia: WriteImage : Saving output image into file...
nvmedia: Processing time per frame 1.1920 ms 

  • Performance result with -O3 option
nvmedia: createSurface: NvMediaImageCreate:: Image size: 1280x720 Image type: 42
nvmedia: createSurface: NvMediaImageCreate:: Image size: 1280x720 Image type: 25
nvmedia: Current Allocate size is 1843200(1280*720)
nvmedia: CPU Loop Processing time per frame 0.7680 ms 
nvmedia: WriteImage : Saving output image into file...
nvmedia: Processing time per frame  1.3410 ms 

Therefore, we can found NVIDIA hardware engine may not work fast as we expected (Like FPGA did). Maybe hardware engine sync with CPU timing issue?
For now, the performance result is even slower than CPU method.

Is there any update? @SivaRamaKrishnaNV

Is there possible to monitor NvMedia2DBlitEx hot point? Then we could locate where is the bottleneck.

Dear @Peter_Pertrili,
I could reproduce the same timings on nvimg_2d sample. Note that, VIC Engine may not be optimal for all operations for all input sizes. For few operations CPU could be better especially if we do multi threading. Also, Generally, It is recommended to time over a loop and average it out. I will internally on this issue and update you if any WAR can be provided.

@SivaRamaKrishnaNV
If there update, please let me know.
Thanks for your kindly help.