Frame rate drops when saving jpg files in Deepstream 6.2 SDK

Do you mean trt file?
Is it related to jpegencode issue?

We want to simulate the same loading as your case.

This is our model file sample.engine (12.1 MB).

Our model file is an FP16 model that can predict two classes.

This application is available for you to test, and it also has problems with frame rate drops.

The config file used by this application is “source1_usb_dec_infer_resnet_int8_for_nv.txt”, which is modified from the sample file “source1_usb_dec_infer_resnet_int8.txt”.

Please place the “source1_usb_dec_infer_resnet_int8_for_nv.txt” file in the same location as the sample file, and then run the application.

deepstream-app-for-nvidia-test.7z (164.7 KB)

By the way, the testing environment is the same as previously mentioned.

1 Like

Hi @kpernos9 ,
We are checking now, and need some time to come back, thank you.

I’ve added some nvds_obj_enc_process() performance measurement code in the deepstream_app.c
deepstream_app.c (58.2 KB)

With enabling the max power of the Orin NX board(Performance — DeepStream 6.2 Release documentation) and max out the clocks (VPI - Vision Programming Interface: Performance Benchmark), the nvds_obj_enc_process() time for one frame is about 1.3ms, it will not impact the FPS too much. Can you try to measure the encoding time on your board?

@Fiona.Chen please tell us how many objects were in the feed during your tests. As i do not get anywhere close to that performance.

If you had less than 5 objects, please run the test again for a real world scenario that has 30+ objects per feed.

Also i’d like to point out, that 1.3ms per frame, will definitely affect performance if you have more than 1 feeds running. As the probe call will block until it is finished saving the frames from all feeds, and cropping the objects for all feeds.

In this situation, say we have 8 feeds, running at 15fps, each full frame process takes around 1.3ms, so right off the bat we are blocking for 9.6ms, then we have to crop/encode each object. If we have 30 objects per frame, which is perfectly reasonable, then we need to add 0.5-1ms to our total for each object. In this example, we are blocking for over 33ms per frame !

That means the total probe blocking time to crop all frames and objects from the batch would be upwards of 240ms, even if you don’t save the full frame, and you just crop objects your at 207ms, that equates to 4.8 FPS, when it should be running at 15 FPS.

Maybe there’s a better way to do this ? but i am following your lead from your examples.

Here’s a short list the time in milliseconds it takes to crop when using your method.

70.537
51.033
21.606
39.027
0.014
19.9
14.797
0.02
96.307
6.388
12.976
17.777
20.59
13.647
12.293
136.774
0.061
12.563
12.065
17.967
12.524
12.703
9.957
19.649
78.536
31.066
12.679
10.466
13.214
8.899
11.194
8.749
79.924
50.797
24.69
8.634
22.959
7.725
15.38
8.101
111.274
41.597
34.405
19.539
19.886
13.751
20.182
20.575
135.396
28.206
59.725
26.235

The output here is from 6 feeds(15fps) on an AGX xavier (maxn, clocks, etc) with peoplenet, we are only cropping and encoding in this run, and not encoding the full frame there are just a few detections in each frame, because it is the middle of the night.

During the day, the performance is absolutely atrocious.

In this example, whenever the probe took longer than the batchpushout time on streammux, we lose fps, and down stream elements become starved, whilst upstream elements become full. So it evolves into a much larger problem than just losing 50ms, suddenly we are dropping random buffers, losing data, and every element in the pipeline losses its ability to operate at the marketed performance nvidia promotes deepstream at.

@rsc44 Are you talking about the same issue as @kpernos9 ? If it is not, please raise your own topic, thank you!

Previously, the nvds_obj_enc_process() took around 2.2ms for one frame, when I max out the clocks, the nvds_obj_enc_process() time for one frame is about 0.8ms on my board.

But I found out that the issue lies with the nvds_obj_enc_finish(), which takes up to 325ms when saving one frame.

I added the following code in this program to calculate the time.

  struct timeval t3,t4,tresult2;
  double timeuse2;
  gettimeofday(&t3, NULL);

  nvds_obj_enc_finish (appCtx->obj_ctx_handle_);

  gettimeofday(&t4, NULL);
  timersub(&t4, &t3, &tresult2);
  timeuse2 = tresult2.tv_sec*1000 + (1.0 * tresult2.tv_usec)/1000;
  g_print("JPG enc finish time %fms\n", timeuse2);
  

The output log segement is as follows.

JPG enc finish time 0.000000ms
JPG enc finish time 0.000000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.000000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.000000ms
JPG enc finish time 0.002000ms
JPG enc finish time 0.002000ms
JPG enc time 0.897000ms
JPG enc finish time 323.984000ms
JPG enc finish time 0.002000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.002000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.000000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.000000ms
JPG enc finish time 0.001000ms
JPG enc finish time 0.001000ms

Yes fiona, the topic is that saving jpgs in the deepstream pipeline causes performance issues if one uses nvds_enc_process.

I decided to further extrapolate the issue, because you provided false information to @kpernos9 saying that it wouldnt cause any FPS drops.

Please move your timing function to after nvds_obj_enc_finish (appCtx->obj_ctx_handle_).

int img_count = 0;
static GstPadProbeReturn
gie_primary_processing_done_buf_prob (GstPad * pad, GstPadProbeInfo * info,
    gpointer u_data)
{
  GstBuffer *buf = (GstBuffer *) info->data;
  AppCtx *appCtx = (AppCtx *) u_data;
  NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta (buf);
  if (!batch_meta) {
    NVGSTDS_WARN_MSG_V ("Batch meta not found for buffer %p", buf);
    return GST_PAD_PROBE_OK;
  }

  write_kitti_output (appCtx, batch_meta);
  /* for image save */
  GstMapInfo inmap = GST_MAP_INFO_INIT;
  if (!gst_buffer_map (buf, &inmap, GST_MAP_READ)) {
    GST_ERROR ("input buffer mapinfo failed");
    return GST_PAD_PROBE_DROP;
  }
  NvBufSurface *ip_surf = (NvBufSurface *) inmap.data;
  gst_buffer_unmap (buf, &inmap);
  struct timeval t1,t2,tresult;
  double timeuse;
  gettimeofday(&t1,NULL);
  img_count++;
  char img_path[FILE_NAME_SIZE];
  strncpy(img_path, "./sample_img.jpg", sizeof(img_path) - 1);
  for (NvDsMetaList *l_frame = batch_meta->frame_meta_list; l_frame != NULL; l_frame = l_frame->next) {
    NvDsFrameMeta *frame_meta = (NvDsFrameMeta *) (l_frame->data);
    if ((img_count % 180) == 0) {        
      img_count = 0;
      NvDsObjectMeta *obj_meta = nvds_acquire_obj_meta_from_pool (batch_meta);
      obj_meta->rect_params.width = ip_surf->surfaceList[0].width;
      obj_meta->rect_params.height = ip_surf->surfaceList[0].height;
      
      obj_meta->rect_params.top = 0;
      obj_meta->rect_params.left = 0;

      NvDsObjEncUsrArgs frameData = {0};
      /* Preset */
       frameData.isFrame = 1;
      /* To be set by user */
      frameData.saveImg = TRUE;
      frameData.attachUsrMeta = TRUE;
      /* Set if Image scaling Required */
      frameData.scaleImg = FALSE;
      frameData.scaledWidth = 0;
      frameData.scaledHeight = 0;
      frameData.objNum = 0;
      snprintf(frameData.fileNameImg, FILE_NAME_SIZE, "%s", img_path);

      nvds_obj_enc_process(appCtx->obj_ctx_handle_, &frameData, ip_surf, NULL, frame_meta);

    }
  }
  nvds_obj_enc_finish (appCtx->obj_ctx_handle_);
  gettimeofday(&t2,NULL);
  timersub(&t2, &t1, &tresult);
  timeuse = tresult.tv_sec*1000 + (1.0 * tresult.tv_usec)/1000;
  g_print("JPG enc time %fms\n", timeuse);
  return GST_PAD_PROBE_OK;
}

nvds_obj_enc_process is a function call that simply pushes the surface, and objmeta to a queue. An underlying function inside your proprietary source code dequeues the buffer and obj meta to do the cropping and encoding.

Due to the fact that API also attaches the jpeg to the metadata, the buffer is blocked from moving to the next element until the crop/encoding is finished for that single batch.

This is accomplished by the function nvds_obj_enc_finish(), which is conceptually just a for while loop waiting on the futures/queue to be completed for that buffer.

To follow up,

I spent the day testing and it turns out the probe function that uses nvds_obj_enc_process is around 2x slower than using plain ol’ appsink and opencv(built with cuda) to do the encoding.

This is very, odd, it seems that 6.0 is also affected, not just 6.2.

@Fiona.Chen if nvds_obj_enc_process is designed to run this slow, might it be better to stop recommending people use this in a probe on nvinfer ?

ps, it shouldn’t take a month to answer these questions

@kpernos9
Seems nvds_obj_enc_finish () is the bottleneck, we are investigating the root cause now.

@Fiona.Chen
Hi, any update or progress? Still waiting for your response.

The bug is fixed. We are testing the patch.

What does your patch fix on? Deepstream? or BSP? or else?

The fix changes two parts, one is the jpeg driver in the BSP, the other is the jpeg library in DeepStream.

Hi Fiona,

could you please be more precise, especially about:

  • What version of jetpack / l4t / deepstream will include this fix?
  • When will the bugfix be released?

The patch will be included in the next release.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.