Use nvds_obj_enc_process to save image is much slower than opencv

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
3080Ti
• DeepStream Version
6.0.1(docker nvcr.io/nvidia/deepstream:6.0-devel)
• JetPack Version (valid for Jetson only)
• TensorRT Version
TensorRT v8001
• NVIDIA GPU Driver Version (valid for GPU only)
470.103.01
• Issue Type( questions, new requirements, bugs)
bugs, the doc say nvds_obj_enc_process is a non-blocking call.Because the nvds_obj_enc_process is call in pipeline callback it affact the pipeline performance and cause phenomenon like Nvds_obj_enc_process for whole frame leaves aritfacts over time

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
use the default deepstream-transfer-learning-app and add timing code,in default 1080p streammux config.

save_image takes 0.217953s
save_image takes 0.136698s
save_image takes 0.141414s
save_image takes 0.136969s

when upsacle to 4k in streammux

save_image takes 0.871812s
save_image takes 0.450789s
save_image takes 0.474793s
save_image takes 0.475765s

use opencv in 4k:

all_bbox_generated called! colorformat =6
save_image takes 0.054761s
all_bbox_generated called! colorformat =6
save_image takes 0.039139s
all_bbox_generated called! colorformat =6
save_image takes 0.037592s
all_bbox_generated called! colorformat =6
save_image takes 0.037836s

the code i use

static void save_images_opencv(NvBufSurface * surface, std::string filename) {
  int count = 0;
  for (uint frameIndex = 0; frameIndex < surface->numFilled;
      frameIndex++) {
    void *src_data = NULL;
    src_data = (char *)malloc(surface->surfaceList[frameIndex].dataSize);
    if (src_data == NULL) {
        g_print("Error: failed to malloc src_data \n");
    }
    cudaMemcpy((void *)src_data,
        (void *)surface->surfaceList[frameIndex].dataPtr,
        surface->surfaceList[frameIndex].dataSize,
        cudaMemcpyDeviceToHost);
    gint frame_width = (gint)surface->surfaceList[frameIndex].width;
    gint frame_height = (gint)surface->surfaceList[frameIndex].height;
    size_t frame_step = surface->surfaceList[frameIndex].pitch;
    printf("all_bbox_generated called! colorformat =%d\n", surface->surfaceList[frameIndex].colorFormat);
    cv::Mat frame = cv::Mat(frame_height * 3/2, frame_width, CV_8UC1, src_data);
    cv::Mat out_mat;
    cv::cvtColor(frame, out_mat, CV_YUV2BGR_NV12);
    
    cv::imwrite(filename, out_mat);
    break;
  }
  return;
}

static bool save_image(const std::string &path,
                       NvBufSurface *ip_surf, NvDsObjectMeta *obj_meta,
                       NvDsFrameMeta *frame_meta, unsigned &obj_counter) {
    NvDsObjEncUsrArgs userData = {0};
    if (path.size() >= sizeof(userData.fileNameImg)) {
        std::cerr << "Folder path too long (path: " << path
                  << ", size: " << path.size() << ") could not save image.\n"
                  << "Should be less than " << sizeof(userData.fileNameImg) << " characters.";
        return false;
    }
    userData.saveImg = TRUE;
    userData.attachUsrMeta = FALSE;
    path.copy(userData.fileNameImg, path.size());
    userData.fileNameImg[path.size()] = '\0';
    userData.objNum = obj_counter++;
    userData.quality = 80;

    g_img_meta_consumer.init_image_save_library_on_first_time();

    auto start = std::chrono::system_clock::now();
    // nvds_obj_enc_process(g_img_meta_consumer.get_obj_ctx_handle(),
    //                      &userData, ip_surf, obj_meta, frame_meta);
    save_images_opencv(ip_surf, userData.fileNameImg);                         
    auto end = std::chrono::system_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    std::cout <<  "save_image takes " << double(duration.count()) * std::chrono::microseconds::period::num / std::chrono::microseconds::period::den 
        << "s" << std::endl;                         
    return true;
}

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Could you attach your time statistics code in deepstream demo?

The code is in the above post.

The save_image fun is in the default deepstream-transfer-learning-app which i only add time statistics code.And save_images_opencv is the my code to save images in opencv

I use T4 server test this demo, below is the result:

t4:
our code:
save_image takes 0.105787s
save_image takes 0.012514s
save_image takes 0.013048s
save_image takes 0.013138s

opencv:
save_image takes 0.054032s
save_image takes 0.029405s
save_image takes 0.024776s
save_image takes 0.024753s

It seems that the difference is not so big. Besides the first frame, the performance of our code is better. We’ll check the time consuming in our low-level code.

Thank you for your work, could you please try to set resolution to 4k in streammux which is slower much in my test.

Ok, on my board:T4, it seems that the diff is not very big too.

4K enc:
save_image takes 0.09789s
save_image takes 0.071233s
save_image takes 0.063071s
save_image takes 0.053526s
4K opencv:
save_image takes 0.450776s
save_image takes 0.065727s
save_image takes 0.065379s
save_image takes 0.070413s

But if it use cuda acceleration to encode jpeg in your gpu, it should be faster. We’ll check it.