There is a question between streammux and nvinfer

Firstly, I will list the pipeline diagram of my entire pipeline.

This is a pipeline for single model four RTSP stream inference.

Then I used the performance testing tool in the FAQ for testing, and I have listed some batch data:

************BATCH-NUM = 293**************
Comp name = nvv4l2decoder1 in_system_timestamp = 1762826702324.726074 out_system_timestamp = 1762826702326.193115               component latency= 1.467041
Comp name = nvstreammux-streammux_0 source_id = 2 pad_index = 2 frame_num = 261               in_system_timestamp = 1762826702326.372070 out_system_timestamp = 1762826702331.375977               component_latency = 5.003906
Comp name = nvv4l2decoder2 in_system_timestamp = 1762826702324.972900 out_system_timestamp = 1762826702327.766113               component latency= 2.793213
Comp name = nvstreammux-streammux_0 source_id = 0 pad_index = 0 frame_num = 261               in_system_timestamp = 1762826702327.927002 out_system_timestamp = 1762826702331.376953               component_latency = 3.449951
Comp name = nvv4l2decoder0 in_system_timestamp = 1762826702324.971924 out_system_timestamp = 1762826702329.519043               component latency= 4.547119
Comp name = nvstreammux-streammux_0 source_id = 3 pad_index = 3 frame_num = 263               in_system_timestamp = 1762826702329.746094 out_system_timestamp = 1762826702331.376953               component_latency = 1.630859
Comp name = nvv4l2decoder3 in_system_timestamp = 1762826702324.989014 out_system_timestamp = 1762826702331.116943               component latency= 6.127930
Comp name = nvstreammux-streammux_0 source_id = 1 pad_index = 1 frame_num = 259               in_system_timestamp = 1762826702331.256104 out_system_timestamp = 1762826702331.376953               component_latency = 0.120850
Comp name = nvinfer0 in_system_timestamp = 1762826702571.242920 out_system_timestamp = 1762826702679.404053               component latency= 108.161133
Comp name = nvvideoconvert0 in_system_timestamp = 1762826702679.553955 out_system_timestamp = 1762826702706.464111               component latency= 26.910156
Source id = 2 Frame_num = 261 Frame latency = 383.421875 (ms) 
Source id = 0 Frame_num = 261 Frame latency = 383.175049 (ms) 
Source id = 3 Frame_num = 263 Frame latency = 383.176025 (ms) 
Source id = 1 Frame_num = 259 Frame latency = 383.158936 (ms) 

************BATCH-NUM = 294**************
Comp name = nvv4l2decoder1 in_system_timestamp = 1762826702369.533936 out_system_timestamp = 1762826702371.176025               component latency= 1.642090
Comp name = nvstreammux-streammux_0 source_id = 2 pad_index = 2 frame_num = 262               in_system_timestamp = 1762826702371.396973 out_system_timestamp = 1762826702376.617920               component_latency = 5.220947
Comp name = nvv4l2decoder3 in_system_timestamp = 1762826702369.474121 out_system_timestamp = 1762826702372.829102               component latency= 3.354980
Comp name = nvstreammux-streammux_0 source_id = 1 pad_index = 1 frame_num = 260               in_system_timestamp = 1762826702373.000977 out_system_timestamp = 1762826702376.618896               component_latency = 3.617920
Comp name = nvv4l2decoder0 in_system_timestamp = 1762826702369.617920 out_system_timestamp = 1762826702374.531006               component latency= 4.913086
Comp name = nvstreammux-streammux_0 source_id = 3 pad_index = 3 frame_num = 264               in_system_timestamp = 1762826702374.667969 out_system_timestamp = 1762826702376.618896               component_latency = 1.950928
Comp name = nvv4l2decoder2 in_system_timestamp = 1762826702369.607910 out_system_timestamp = 1762826702376.277100               component latency= 6.669189
Comp name = nvstreammux-streammux_0 source_id = 0 pad_index = 0 frame_num = 262               in_system_timestamp = 1762826702376.479004 out_system_timestamp = 1762826702376.618896               component_latency = 0.139893
Comp name = nvinfer0 in_system_timestamp = 1762826702621.205078 out_system_timestamp = 1762826702726.452881               component latency= 105.247803
Comp name = nvvideoconvert0 in_system_timestamp = 1762826702726.670898 out_system_timestamp = 1762826702752.854980               component latency= 26.184082
Source id = 2 Frame_num = 262 Frame latency = 384.465088 (ms) 
Source id = 1 Frame_num = 260 Frame latency = 384.524902 (ms) 
Source id = 3 Frame_num = 264 Frame latency = 384.381104 (ms) 
Source id = 0 Frame_num = 262 Frame latency = 384.391113 (ms) 

When I looked at the 293 batch, I found that there was a time difference of over 200 ms between the out_stystem_timestamp of nvstreammux-streammux_0 and the in_stystem_timestamp of nvinfer0.What is the situation? How to reduce the delay in this part?

Here are some properties of the components:

nvurisrcbin:
  type: 0
  gpu-id: 0
  cudadec-memtype: 0
  num-extra-surfaces: 2
  dec-skip-frames: 0
  drop-frame-interval: 0
  select-rtp-protocol: 0
  latency: 5000
  udp-buffer-size: 0
  rtsp-reconnect-interval: 10
nvstreammux:
  batched-push-timeout: 40000
  buffer-pool-size: 16
  width: 1920
  height: 1080
  compute-hw: 0
  interpolation-method: 1
  enable-padding: false
  nvbuf-memory-type: 0
  num-surfaces-per-frame: 1
  sync-inputs: false 
  attach-sys-ts: true
  async-process: true
  live-source: true
  max-latency: 0

The buffer may be stored in the queue temporarily if the downstream does not consume the buffers in time..

@Fiona.Chen
What does downstream process refer to here? I have checked the source code of Nvinfer and found that queueInputBatch and dequeueOutputBatch functions in nvdsinfer_context_impl.cpp execute very quickly, but I don’t know where the time is spent

 NvDsInferStatus NvDsInferContextImpl::queueInputBatch(NvDsInferContextBatchInput &batchInput)
    {
        cudaEvent_t preprocStart, preprocEnd, inferStart, inferEnd, copyStart, copyEnd;
        cudaEventCreate(&preprocStart);
        cudaEventCreate(&preprocEnd);
        cudaEventCreate(&inferStart);
        cudaEventCreate(&inferEnd);
        cudaEventCreate(&copyStart);
        cudaEventCreate(&copyEnd);

        assert(m_Initialized);
        uint32_t batchSize = batchInput.numInputFrames;

        /* Check that current batch size does not exceed max batch size. */
        if (batchSize > m_MaxBatchSize)
        {
            printError("Not inferring on batch since it's size(%d) exceeds max batch size(%d)", batchSize, m_MaxBatchSize);
            return NVDSINFER_INVALID_PARAMS;
        }

        /* Set the cuda device to be used. */
        RETURN_CUDA_ERR(cudaSetDevice(m_GpuID), "queue buffer failed to set cuda device(%s)", m_GpuID);

        std::shared_ptr<CudaEvent> preprocWaitEvent = m_InputConsumedEvent;

        assert(m_Preprocessor && m_InputConsumedEvent);

        cudaEventRecord(preprocStart, *m_InferStream);
        RETURN_NVINFER_ERROR(m_Preprocessor->transform(batchInput, m_BindingBuffers[INPUT_LAYER_INDEX], *m_InferStream, preprocWaitEvent.get()), "Preproc trans input data failed.");
        cudaEventRecord(preprocEnd, *m_InferStream);

        auto recyleFunc = [this](NvDsInferBatch *batch)
        {
            if (batch)
                m_FreeBatchQueue.push(batch);
        };
        std::unique_ptr<NvDsInferBatch, decltype(recyleFunc)> safeRecyleBatch(m_FreeBatchQueue.pop(), recyleFunc);
        assert(safeRecyleBatch);
        safeRecyleBatch->m_BatchSize = batchSize;

        /* Fill the array of binding buffers for the current batch. */
        std::vector<void *> &bindings = safeRecyleBatch->m_DeviceBuffers;
        auto backendBuffer = std::make_shared<BackendBatchBuffer>(bindings, m_AllLayerInfo, batchSize);
        assert(m_BackendContext && backendBuffer);
        assert(m_InferStream && m_InputConsumedEvent && m_InferCompleteEvent);

        cudaEventRecord(inferStart, *m_InferStream);
        RETURN_NVINFER_ERROR(m_BackendContext->enqueueBuffer(backendBuffer, *m_InferStream, m_InputConsumedEvent.get()), "Infer context enqueue buffer failed");
        cudaEventRecord(inferEnd, *m_InferStream);

        /* Record event on m_InferStream to indicate completion of inference on the
         * current batch. */
        RETURN_CUDA_ERR(cudaEventRecord(*m_InferCompleteEvent, *m_InferStream), "Failed to record cuda infer-complete-event ");

        assert(m_PostprocessStream && m_InferCompleteEvent);
        /* Make future jobs on the postprocessing stream wait on the infer
         * completion event. */
        RETURN_CUDA_ERR(cudaStreamWaitEvent(*m_PostprocessStream, *m_InferCompleteEvent, 0), "postprocessing cuda waiting event failed ");

        cudaEventRecord(copyStart, *m_InferStream);
        RETURN_NVINFER_ERROR(m_Postprocessor->copyBuffersToHostMemory(*safeRecyleBatch, *m_PostprocessStream), "post cuda process failed.");
        cudaEventRecord(copyEnd, *m_InferStream);
        cudaEventSynchronize(copyEnd);

        float preprocTime, inferTime, copyTime;
        cudaEventElapsedTime(&preprocTime, preprocStart, preprocEnd);
        cudaEventElapsedTime(&inferTime, inferStart, inferEnd);
        cudaEventElapsedTime(&copyTime, copyStart, copyEnd);

        printf("  ========================================\n  ");
        printf("  1. GPU Preproc:  %.2f ms\n", preprocTime);
        printf("  2. GPU Infer:    %.2f ms\n", inferTime);
        printf("  3. GPU→CPU copy: %.2f ms\n", copyTime);

        m_ProcessBatchQueue.push(safeRecyleBatch.release());
        return NVDSINFER_SUCCESS;
    }

out:

  ========================================
    1. GPU Preproc:  0.98 ms
  2. GPU Infer:    20.52 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  1.00 ms
  2. GPU Infer:    20.59 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  0.70 ms
  2. GPU Infer:    14.23 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  1.05 ms
  2. GPU Infer:    20.59 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  1.01 ms
  2. GPU Infer:    20.44 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  0.69 ms
  2. GPU Infer:    14.17 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  1.04 ms
  2. GPU Infer:    20.47 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  0.93 ms
  2. GPU Infer:    20.46 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  0.65 ms
  2. GPU Infer:    14.16 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  1.02 ms
  2. GPU Infer:    20.63 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  0.93 ms
  2. GPU Infer:    20.44 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  0.68 ms
  2. GPU Infer:    14.20 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
    1. GPU Preproc:  0.64 ms
  2. GPU Infer:    14.18 ms
  3. GPU→CPU copy: 0.00 ms
  ========================================
************BATCH-NUM = 22**************
Comp name = nvv4l2decoder0 in_system_timestamp = 1762845268305.341064 out_system_timestamp = 1762845268333.204102               component latency= 27.863037
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 15               in_system_timestamp = 1762845268333.305908 out_system_timestamp = 1762845268356.762939               component_latency = 23.457031
Comp name = nvv4l2decoder3 in_system_timestamp = 1762845268317.388916 out_system_timestamp = 1762845268334.867920               component latency= 17.479004
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 15               in_system_timestamp = 1762845268334.940918 out_system_timestamp = 1762845268356.762939               component_latency = 21.822021
Comp name = nvv4l2decoder1 in_system_timestamp = 1762845268297.022949 out_system_timestamp = 1762845268336.510010               component latency= 39.487061
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 15               in_system_timestamp = 1762845268336.575928 out_system_timestamp = 1762845268356.762939               component_latency = 20.187012
Comp name = nvv4l2decoder2 in_system_timestamp = 1762845268347.959961 out_system_timestamp = 1762845268348.760986               component latency= 0.801025
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 21               in_system_timestamp = 1762845268348.855957 out_system_timestamp = 1762845268356.762939               component_latency = 7.906982
Comp name = nvinfer0 in_system_timestamp = 1762845268394.969971 out_system_timestamp = 1762845268614.472900               component latency= 219.502930
Source id = 1 Frame_num = 15 Frame latency = 309.230957 (ms)
Source id = 0 Frame_num = 15 Frame latency = 297.183105 (ms)
Source id = 2 Frame_num = 15 Frame latency = 317.549072 (ms)
Source id = 3 Frame_num = 21 Frame latency = 266.612061 (ms)
GPU Preproc:  1.34 ms
GPU Infer:    35.55 ms
GPU→CPU copy: 0.00 ms

************BATCH-NUM = 23**************
Comp name = nvv4l2decoder2 in_system_timestamp = 1762845268388.331055 out_system_timestamp = 1762845268389.250977               component latency= 0.919922
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 22               in_system_timestamp = 1762845268389.343994 out_system_timestamp = 1762845268404.947021               component_latency = 15.603027
Comp name = nvv4l2decoder0 in_system_timestamp = 1762845268382.289062 out_system_timestamp = 1762845268401.531006               component latency= 19.241943
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 16               in_system_timestamp = 1762845268401.729004 out_system_timestamp = 1762845268404.947021               component_latency = 3.218018
Comp name = nvv4l2decoder1 in_system_timestamp = 1762845268387.968018 out_system_timestamp = 1762845268403.187012               component latency= 15.218994
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 16               in_system_timestamp = 1762845268403.272949 out_system_timestamp = 1762845268404.947021               component_latency = 1.674072
Comp name = nvv4l2decoder3 in_system_timestamp = 1762845268397.542969 out_system_timestamp = 1762845268404.819092               component latency= 7.276123
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 16               in_system_timestamp = 1762845268404.897949 out_system_timestamp = 1762845268404.947021               component_latency = 0.049072
Comp name = nvinfer0 in_system_timestamp = 1762845268423.340088 out_system_timestamp = 1762845268656.212891               component latency= 232.872803
Source id = 3 Frame_num = 22 Frame latency = 267.996826 (ms)
Source id = 1 Frame_num = 16 Frame latency = 274.038818 (ms)
Source id = 2 Frame_num = 16 Frame latency = 268.359863 (ms)
Source id = 0 Frame_num = 16 Frame latency = 258.784912 (ms)
GPU Preproc:  1.27 ms
GPU Infer:    36.32 ms
GPU→CPU copy: 0.00 ms

************BATCH-NUM = 24**************
Comp name = nvv4l2decoder2 in_system_timestamp = 1762845268418.636963 out_system_timestamp = 1762845268419.510986               component latency= 0.874023
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 23               in_system_timestamp = 1762845268419.617920 out_system_timestamp = 1762845268433.851074               component_latency = 14.233154
Comp name = nvv4l2decoder1 in_system_timestamp = 1762845268418.197998 out_system_timestamp = 1762845268430.413086               component latency= 12.215088
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 17               in_system_timestamp = 1762845268430.562988 out_system_timestamp = 1762845268433.851074               component_latency = 3.288086
Comp name = nvv4l2decoder0 in_system_timestamp = 1762845268382.289062 out_system_timestamp = 1762845268432.050049               component latency= 49.760986
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 17               in_system_timestamp = 1762845268432.132080 out_system_timestamp = 1762845268433.851074               component_latency = 1.718994
Comp name = nvv4l2decoder3 in_system_timestamp = 1762845268397.542969 out_system_timestamp = 1762845268433.702881               component latency= 36.159912
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 17               in_system_timestamp = 1762845268433.802979 out_system_timestamp = 1762845268433.851074               component_latency = 0.048096
Comp name = nvinfer0 in_system_timestamp = 1762845268468.127930 out_system_timestamp = 1762845268699.215088               component latency= 231.087158
Source id = 3 Frame_num = 23 Frame latency = 280.667969 (ms)
Source id = 2 Frame_num = 17 Frame latency = 281.106934 (ms)
Source id = 1 Frame_num = 17 Frame latency = 317.015869 (ms)
Source id = 0 Frame_num = 17 Frame latency = 301.761963 (ms)
GPU Preproc:  1.28 ms
GPU Infer:    35.78 ms
GPU→CPU copy: 0.01 ms

All the elements after the element you mentioned.

Have you setup max power according to Performance — DeepStream documentation?

@Fiona.Chen
I have already set up all the necessary settings, and now I am optimizing them for the model level.

Can you provide the complete pipeline of this log? There is no “nvvideoconvert” log, seems it is different to the original post.

my_cpp.txt (11.5 KB)

The above is the txt file of my CPP code. I used the YOLOv5 model for inference.

Regarding the above issue, the time between streammux and nvinfer may be due to the instability of one of my RTSP streams. Currently, my main focus is on researching the time consumption of GstNvinfer. What I am thinking about now is the latency of the nvinfer plugin = preprocess time + tensorrt infer time + postprocess time.

When I use performance testing in the FAQ, output:

************BATCH-NUM = 51**************
Comp name = nvv4l2decoder0 in_system_timestamp = 1763017653353.937988 out_system_timestamp = 1763017653356.047119               component latency= 2.109131
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 53               in_system_timestamp = 1763017653356.104004 out_system_timestamp = 1763017653361.210938               component_latency = 5.106934
Comp name = nvv4l2decoder2 in_system_timestamp = 1763017653354.068115 out_system_timestamp = 1763017653357.746094               component latency= 3.677979
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 53               in_system_timestamp = 1763017653357.802979 out_system_timestamp = 1763017653361.210938               component_latency = 3.407959
Comp name = nvv4l2decoder1 in_system_timestamp = 1763017653357.212891 out_system_timestamp = 1763017653359.407959               component latency= 2.195068
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 53               in_system_timestamp = 1763017653359.449951 out_system_timestamp = 1763017653361.210938               component_latency = 1.760986
Comp name = nvv4l2decoder3 in_system_timestamp = 1763017653356.145020 out_system_timestamp = 1763017653361.105957               component latency= 4.960938
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 53               in_system_timestamp = 1763017653361.170898 out_system_timestamp = 1763017653361.210938               component_latency = 0.040039
Comp name = nvinfer0 in_system_timestamp = 1763017653361.271973 out_system_timestamp = 1763017653403.332031               component latency= 42.060059
Source id = 2 Frame_num = 53 Frame latency = 49.482910 (ms)
Source id = 1 Frame_num = 53 Frame latency = 49.352783 (ms)
Source id = 0 Frame_num = 53 Frame latency = 46.208008 (ms)
Source id = 3 Frame_num = 53 Frame latency = 47.275879 (ms)
************BATCH-NUM = 52**************
Comp name = nvv4l2decoder0 in_system_timestamp = 1763017653392.892090 out_system_timestamp = 1763017653395.000977               component latency= 2.108887
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 54               in_system_timestamp = 1763017653395.058105 out_system_timestamp = 1763017653400.094971               component_latency = 5.036865
Comp name = nvv4l2decoder2 in_system_timestamp = 1763017653393.155029 out_system_timestamp = 1763017653396.683105               component latency= 3.528076
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 54               in_system_timestamp = 1763017653396.750000 out_system_timestamp = 1763017653400.094971               component_latency = 3.344971
Comp name = nvv4l2decoder3 in_system_timestamp = 1763017653395.847900 out_system_timestamp = 1763017653398.332031               component latency= 2.484131
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 54               in_system_timestamp = 1763017653398.445068 out_system_timestamp = 1763017653400.094971               component_latency = 1.649902
Comp name = nvv4l2decoder1 in_system_timestamp = 1763017653397.125000 out_system_timestamp = 1763017653399.971924               component latency= 2.846924
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 54               in_system_timestamp = 1763017653400.039062 out_system_timestamp = 1763017653400.094971               component_latency = 0.055908
Comp name = nvinfer0 in_system_timestamp = 1763017653400.162109 out_system_timestamp = 1763017653442.733887               component latency= 42.571777
Source id = 2 Frame_num = 54 Frame latency = 49.934814 (ms)
Source id = 1 Frame_num = 54 Frame latency = 49.671875 (ms)
Source id = 3 Frame_num = 54 Frame latency = 46.979004 (ms)
Source id = 0 Frame_num = 54 Frame latency = 45.701904 (ms)
Comp name = nvv4l2decoder2 in_system_timestamp = 1763017653423.589111 out_system_timestamp = 1763017653425.538086               component latency= 1.948975
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 55               in_system_timestamp = 1763017653425.597900 out_system_timestamp = 1763017653430.566895               component_latency = 4.968994
Comp name = nvv4l2decoder0 in_system_timestamp = 1763017653423.541016 out_system_timestamp = 1763017653427.193115               component latency= 3.652100
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 55               in_system_timestamp = 1763017653427.239014 out_system_timestamp = 1763017653430.566895               component_latency = 3.327881
Comp name = nvv4l2decoder3 in_system_timestamp = 1763017653426.011963 out_system_timestamp = 1763017653428.868896               component latency= 2.856934
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 55               in_system_timestamp = 1763017653428.939941 out_system_timestamp = 1763017653430.566895               component_latency = 1.626953
Comp name = nvv4l2decoder1 in_system_timestamp = 1763017653427.326904 out_system_timestamp = 1763017653430.489014               component latency= 3.162109
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 55               in_system_timestamp = 1763017653430.528076 out_system_timestamp = 1763017653430.566895               component_latency = 0.038818
Comp name = nvinfer0 in_system_timestamp = 1763017653430.627930 out_system_timestamp = 1763017653472.799072               component latency= 42.171143
Source id = 1 Frame_num = 55 Frame latency = 49.272949 (ms)
Source id = 2 Frame_num = 55 Frame latency = 49.321045 (ms)
Source id = 3 Frame_num = 55 Frame latency = 46.850098 (ms)
Source id = 0 Frame_num = 55 Frame latency = 45.535156 (ms)

The latency of each batch’s NvInfer plugin is 42ms, 42.5ms, and 42.17ms, respectively.

Then I entered nvdsinfer to modify the

NvDsInferStatus NvDsInferContextImpl::queueInputBatch(NvDsInferContextBatchInput &batchInput)
    {
        auto func_start = std::chrono::high_resolution_clock::now();

        cudaEvent_t preprocStart, preprocEnd, inferStart, inferEnd, copyStart, copyEnd;
        cudaEventCreate(&preprocStart);
        cudaEventCreate(&preprocEnd);
        cudaEventCreate(&inferStart);
        cudaEventCreate(&inferEnd);
        cudaEventCreate(&copyStart);
        cudaEventCreate(&copyEnd);

        assert(m_Initialized);
        uint32_t batchSize = batchInput.numInputFrames;

        /* Check that current batch size does not exceed max batch size. */
        if (batchSize > m_MaxBatchSize)
        {
            printError("Not inferring on batch since it's size(%d) exceeds max batch size(%d)", batchSize, m_MaxBatchSize);
            return NVDSINFER_INVALID_PARAMS;
        }

        /* Set the cuda device to be used. */
        RETURN_CUDA_ERR(cudaSetDevice(m_GpuID), "queue buffer failed to set cuda device(%s)", m_GpuID);

        std::shared_ptr<CudaEvent> preprocWaitEvent = m_InputConsumedEvent;

        assert(m_Preprocessor && m_InputConsumedEvent);

        cudaEventRecord(preprocStart, *m_InferStream);
        RETURN_NVINFER_ERROR(m_Preprocessor->transform(batchInput, m_BindingBuffers[INPUT_LAYER_INDEX], *m_InferStream, preprocWaitEvent.get()), "Preproc trans input data failed.");
        cudaEventRecord(preprocEnd, *m_InferStream);

        auto recyleFunc = [this](NvDsInferBatch *batch)
        {
            if (batch)
                m_FreeBatchQueue.push(batch);
        };
        std::unique_ptr<NvDsInferBatch, decltype(recyleFunc)> safeRecyleBatch(m_FreeBatchQueue.pop(), recyleFunc);
        assert(safeRecyleBatch);
        safeRecyleBatch->m_BatchSize = batchSize;

        /* Fill the array of binding buffers for the current batch. */
        std::vector<void *> &bindings = safeRecyleBatch->m_DeviceBuffers;
        auto backendBuffer = std::make_shared<BackendBatchBuffer>(bindings, m_AllLayerInfo, batchSize);
        assert(m_BackendContext && backendBuffer);
        assert(m_InferStream && m_InputConsumedEvent && m_InferCompleteEvent);

        cudaEventRecord(inferStart, *m_InferStream);
        RETURN_NVINFER_ERROR(m_BackendContext->enqueueBuffer(backendBuffer, *m_InferStream, m_InputConsumedEvent.get()), "Infer context enqueue buffer failed");
        cudaEventRecord(inferEnd, *m_InferStream);

        /* Record event on m_InferStream to indicate completion of inference on the
         * current batch. */
        RETURN_CUDA_ERR(cudaEventRecord(*m_InferCompleteEvent, *m_InferStream), "Failed to record cuda infer-complete-event ");

        assert(m_PostprocessStream && m_InferCompleteEvent);
        /* Make future jobs on the postprocessing stream wait on the infer
         * completion event. */
        RETURN_CUDA_ERR(cudaStreamWaitEvent(*m_PostprocessStream, *m_InferCompleteEvent, 0), "postprocessing cuda waiting event failed ");

        cudaEventRecord(copyStart, *m_InferStream);
        RETURN_NVINFER_ERROR(m_Postprocessor->copyBuffersToHostMemory(*safeRecyleBatch, *m_PostprocessStream), "post cuda process failed.");
        cudaEventRecord(copyEnd, *m_InferStream);
        cudaEventSynchronize(copyEnd);

        float preprocTime, inferTime, copyTime;
        cudaEventElapsedTime(&preprocTime, preprocStart, preprocEnd);
        cudaEventElapsedTime(&inferTime, inferStart, inferEnd);
        cudaEventElapsedTime(&copyTime, copyStart, copyEnd);

        printf("GPU Preproc:  %.2f ms\n", preprocTime);
        printf("GPU Infer:    %.2f ms\n", inferTime);
        printf("GPU→CPU copy: %.2f ms\n", copyTime);

        m_ProcessBatchQueue.push(safeRecyleBatch.release());

        auto func_end = std::chrono::high_resolution_clock::now();
        auto func_duration = std::chrono::duration_cast<std::chrono::microseconds>(func_end - func_start).count() / 1000.0;

        printf("[queueInputBatch] Total function time: %.2f ms (batch size: %d)\n", func_duration, batchSize);
        return NVDSINFER_SUCCESS;
    }
NvDsInferStatus NvDsInferContextImpl::dequeueOutputBatch(NvDsInferContextBatchOutput &batchOutput)
    {
        auto func_start = std::chrono::high_resolution_clock::now();
        assert(m_Initialized);
        auto recyleFunc = [this](NvDsInferBatch *batch)
        {
            if (batch)
                m_FreeBatchQueue.push(batch);
        };
        std::unique_ptr<NvDsInferBatch, decltype(recyleFunc)> recyleBatch(m_ProcessBatchQueue.pop(), recyleFunc);
        assert(recyleBatch);

        /* Set the cuda device */
        RETURN_CUDA_ERR(cudaSetDevice(m_GpuID), "dequeue buffer failed to set cuda device(%s)", m_GpuID);

        /* Wait for the copy to the current set of host buffers to complete. */
        RETURN_CUDA_ERR(cudaEventSynchronize(*recyleBatch->m_OutputCopyDoneEvent), "Failed to synchronize on cuda copy-coplete-event");

        assert(m_Postprocessor);
        /* Fill the host buffers information in the output. */
        RETURN_NVINFER_ERROR(m_Postprocessor->postProcessHost(*recyleBatch, batchOutput), "postprocessing host buffers failed.");

        /* Hold batch private data */
        batchOutput.priv = (void *)recyleBatch.release();

        auto func_end = std::chrono::high_resolution_clock::now();
        auto func_duration = std::chrono::duration_cast<std::chrono::microseconds>(func_end - func_start).count() / 1000.0;

        printf("[dequeueOutputBatch] Total function time: %.2f ms \n", func_duration);
        return NVDSINFER_SUCCESS;
    }

The corresponding output is:

************BATCH-NUM = 51**************
GPU Preproc:  1.28 ms
GPU Infer:    14.47 ms
GPU→CPU copy: 0.00 ms
[queueInputBatch] Total function time: 16.33 ms (batch size: 4)
[dequeueOutputBatch] Total function time: 6.43 ms
************BATCH-NUM = 52**************
GPU Preproc:  1.24 ms
GPU Infer:    14.14 ms
GPU→CPU copy: 0.00 ms
[queueInputBatch] Total function time: 15.93 ms (batch size: 4)
[dequeueOutputBatch] Total function time: 6.38 ms
************BATCH-NUM = 53**************
GPU Preproc:  1.24 ms
GPU Infer:    14.06 ms
GPU→CPU copy: 0.00 ms
[queueInputBatch] Total function time: 15.84 ms (batch size: 4)
[dequeueOutputBatch] Total function time: 6.12 ms

The time required for preprocessing, inference, and post-processing is much shorter than that of the Nvinfer plugin. And my nvinfer is linked to fakesink.There should be no blockage downstream.

I don’t know where the rest of my time was spent, and I couldn’t find it either.

Your code of the preprocess time calculation is not complete. Please be familiar with the gst-nvinfer, the gst-nvinfer diagram may help you to know more about the code.

The gst-nvinfer default preprocessing is constructed by two parts, the first part is the video frame data scaling and format conversion, the second part is normalization and tensor generation. The “m_Preprocessor->transform()” is just the second part.

Even after dequeueOutputBatch finishes the postprocessing just start, you ignore all the postprocessing after dequeueOutputBatch in gst_nvinfer_output_loop().

After postprocessing finishes, the batch inferencing and processing finishes and the GstBuffer can be deliverred to downstream through the src pad.

gst-nvinfer is totally open source, the code is the best answer.

@Fiona.Chen
Thank you for your reply. I am currently reading the relevant code to see if I can find the problem.

Why do you think it is a problem? What kind of problem?

@Fiona.Chen
This is not a problem, it’s my doubt. I used the YOLOv5-n model and YOLOv5 tiny model for performance testing comparison. Both models were quantized with int8+fp16, and the interval attribute of pgie was set to 0. The result is that the inference performance of the two models is similar, but in reality, there is still a significant difference in the number of parameters and model complexity between the two models.

Subsequently, I conducted tests on 8 RTSP streams. When I set the interval to 0, there was a significant delay in the pipeline. The delay of the nvinfer component reached 300-400ms, and when I set the interval to 1, the delay was about 120ms. I wanted to find out where all the time was wasted or whether it was not wasted. There were so many delays during normal operation.

Coincidentally, I also want to gain a deeper understanding of the overall framework, so I am currently reading the source code of NVIDIA and learning based on the nvinfer code diagram in FAQ.

That’s great!