my_cpp.txt (11.5 KB)
The above is the txt file of my CPP code. I used the YOLOv5 model for inference.
Regarding the above issue, the time between streammux and nvinfer may be due to the instability of one of my RTSP streams. Currently, my main focus is on researching the time consumption of GstNvinfer. What I am thinking about now is the latency of the nvinfer plugin = preprocess time + tensorrt infer time + postprocess time.
When I use performance testing in the FAQ, output:
************BATCH-NUM = 51**************
Comp name = nvv4l2decoder0 in_system_timestamp = 1763017653353.937988 out_system_timestamp = 1763017653356.047119 component latency= 2.109131
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 53 in_system_timestamp = 1763017653356.104004 out_system_timestamp = 1763017653361.210938 component_latency = 5.106934
Comp name = nvv4l2decoder2 in_system_timestamp = 1763017653354.068115 out_system_timestamp = 1763017653357.746094 component latency= 3.677979
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 53 in_system_timestamp = 1763017653357.802979 out_system_timestamp = 1763017653361.210938 component_latency = 3.407959
Comp name = nvv4l2decoder1 in_system_timestamp = 1763017653357.212891 out_system_timestamp = 1763017653359.407959 component latency= 2.195068
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 53 in_system_timestamp = 1763017653359.449951 out_system_timestamp = 1763017653361.210938 component_latency = 1.760986
Comp name = nvv4l2decoder3 in_system_timestamp = 1763017653356.145020 out_system_timestamp = 1763017653361.105957 component latency= 4.960938
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 53 in_system_timestamp = 1763017653361.170898 out_system_timestamp = 1763017653361.210938 component_latency = 0.040039
Comp name = nvinfer0 in_system_timestamp = 1763017653361.271973 out_system_timestamp = 1763017653403.332031 component latency= 42.060059
Source id = 2 Frame_num = 53 Frame latency = 49.482910 (ms)
Source id = 1 Frame_num = 53 Frame latency = 49.352783 (ms)
Source id = 0 Frame_num = 53 Frame latency = 46.208008 (ms)
Source id = 3 Frame_num = 53 Frame latency = 47.275879 (ms)
************BATCH-NUM = 52**************
Comp name = nvv4l2decoder0 in_system_timestamp = 1763017653392.892090 out_system_timestamp = 1763017653395.000977 component latency= 2.108887
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 54 in_system_timestamp = 1763017653395.058105 out_system_timestamp = 1763017653400.094971 component_latency = 5.036865
Comp name = nvv4l2decoder2 in_system_timestamp = 1763017653393.155029 out_system_timestamp = 1763017653396.683105 component latency= 3.528076
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 54 in_system_timestamp = 1763017653396.750000 out_system_timestamp = 1763017653400.094971 component_latency = 3.344971
Comp name = nvv4l2decoder3 in_system_timestamp = 1763017653395.847900 out_system_timestamp = 1763017653398.332031 component latency= 2.484131
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 54 in_system_timestamp = 1763017653398.445068 out_system_timestamp = 1763017653400.094971 component_latency = 1.649902
Comp name = nvv4l2decoder1 in_system_timestamp = 1763017653397.125000 out_system_timestamp = 1763017653399.971924 component latency= 2.846924
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 54 in_system_timestamp = 1763017653400.039062 out_system_timestamp = 1763017653400.094971 component_latency = 0.055908
Comp name = nvinfer0 in_system_timestamp = 1763017653400.162109 out_system_timestamp = 1763017653442.733887 component latency= 42.571777
Source id = 2 Frame_num = 54 Frame latency = 49.934814 (ms)
Source id = 1 Frame_num = 54 Frame latency = 49.671875 (ms)
Source id = 3 Frame_num = 54 Frame latency = 46.979004 (ms)
Source id = 0 Frame_num = 54 Frame latency = 45.701904 (ms)
Comp name = nvv4l2decoder2 in_system_timestamp = 1763017653423.589111 out_system_timestamp = 1763017653425.538086 component latency= 1.948975
Comp name = nvstreammux-stream-muxer source_id = 1 pad_index = 1 frame_num = 55 in_system_timestamp = 1763017653425.597900 out_system_timestamp = 1763017653430.566895 component_latency = 4.968994
Comp name = nvv4l2decoder0 in_system_timestamp = 1763017653423.541016 out_system_timestamp = 1763017653427.193115 component latency= 3.652100
Comp name = nvstreammux-stream-muxer source_id = 2 pad_index = 2 frame_num = 55 in_system_timestamp = 1763017653427.239014 out_system_timestamp = 1763017653430.566895 component_latency = 3.327881
Comp name = nvv4l2decoder3 in_system_timestamp = 1763017653426.011963 out_system_timestamp = 1763017653428.868896 component latency= 2.856934
Comp name = nvstreammux-stream-muxer source_id = 3 pad_index = 3 frame_num = 55 in_system_timestamp = 1763017653428.939941 out_system_timestamp = 1763017653430.566895 component_latency = 1.626953
Comp name = nvv4l2decoder1 in_system_timestamp = 1763017653427.326904 out_system_timestamp = 1763017653430.489014 component latency= 3.162109
Comp name = nvstreammux-stream-muxer source_id = 0 pad_index = 0 frame_num = 55 in_system_timestamp = 1763017653430.528076 out_system_timestamp = 1763017653430.566895 component_latency = 0.038818
Comp name = nvinfer0 in_system_timestamp = 1763017653430.627930 out_system_timestamp = 1763017653472.799072 component latency= 42.171143
Source id = 1 Frame_num = 55 Frame latency = 49.272949 (ms)
Source id = 2 Frame_num = 55 Frame latency = 49.321045 (ms)
Source id = 3 Frame_num = 55 Frame latency = 46.850098 (ms)
Source id = 0 Frame_num = 55 Frame latency = 45.535156 (ms)
The latency of each batch’s NvInfer plugin is 42ms, 42.5ms, and 42.17ms, respectively.
Then I entered nvdsinfer to modify the
NvDsInferStatus NvDsInferContextImpl::queueInputBatch(NvDsInferContextBatchInput &batchInput)
{
auto func_start = std::chrono::high_resolution_clock::now();
cudaEvent_t preprocStart, preprocEnd, inferStart, inferEnd, copyStart, copyEnd;
cudaEventCreate(&preprocStart);
cudaEventCreate(&preprocEnd);
cudaEventCreate(&inferStart);
cudaEventCreate(&inferEnd);
cudaEventCreate(©Start);
cudaEventCreate(©End);
assert(m_Initialized);
uint32_t batchSize = batchInput.numInputFrames;
/* Check that current batch size does not exceed max batch size. */
if (batchSize > m_MaxBatchSize)
{
printError("Not inferring on batch since it's size(%d) exceeds max batch size(%d)", batchSize, m_MaxBatchSize);
return NVDSINFER_INVALID_PARAMS;
}
/* Set the cuda device to be used. */
RETURN_CUDA_ERR(cudaSetDevice(m_GpuID), "queue buffer failed to set cuda device(%s)", m_GpuID);
std::shared_ptr<CudaEvent> preprocWaitEvent = m_InputConsumedEvent;
assert(m_Preprocessor && m_InputConsumedEvent);
cudaEventRecord(preprocStart, *m_InferStream);
RETURN_NVINFER_ERROR(m_Preprocessor->transform(batchInput, m_BindingBuffers[INPUT_LAYER_INDEX], *m_InferStream, preprocWaitEvent.get()), "Preproc trans input data failed.");
cudaEventRecord(preprocEnd, *m_InferStream);
auto recyleFunc = [this](NvDsInferBatch *batch)
{
if (batch)
m_FreeBatchQueue.push(batch);
};
std::unique_ptr<NvDsInferBatch, decltype(recyleFunc)> safeRecyleBatch(m_FreeBatchQueue.pop(), recyleFunc);
assert(safeRecyleBatch);
safeRecyleBatch->m_BatchSize = batchSize;
/* Fill the array of binding buffers for the current batch. */
std::vector<void *> &bindings = safeRecyleBatch->m_DeviceBuffers;
auto backendBuffer = std::make_shared<BackendBatchBuffer>(bindings, m_AllLayerInfo, batchSize);
assert(m_BackendContext && backendBuffer);
assert(m_InferStream && m_InputConsumedEvent && m_InferCompleteEvent);
cudaEventRecord(inferStart, *m_InferStream);
RETURN_NVINFER_ERROR(m_BackendContext->enqueueBuffer(backendBuffer, *m_InferStream, m_InputConsumedEvent.get()), "Infer context enqueue buffer failed");
cudaEventRecord(inferEnd, *m_InferStream);
/* Record event on m_InferStream to indicate completion of inference on the
* current batch. */
RETURN_CUDA_ERR(cudaEventRecord(*m_InferCompleteEvent, *m_InferStream), "Failed to record cuda infer-complete-event ");
assert(m_PostprocessStream && m_InferCompleteEvent);
/* Make future jobs on the postprocessing stream wait on the infer
* completion event. */
RETURN_CUDA_ERR(cudaStreamWaitEvent(*m_PostprocessStream, *m_InferCompleteEvent, 0), "postprocessing cuda waiting event failed ");
cudaEventRecord(copyStart, *m_InferStream);
RETURN_NVINFER_ERROR(m_Postprocessor->copyBuffersToHostMemory(*safeRecyleBatch, *m_PostprocessStream), "post cuda process failed.");
cudaEventRecord(copyEnd, *m_InferStream);
cudaEventSynchronize(copyEnd);
float preprocTime, inferTime, copyTime;
cudaEventElapsedTime(&preprocTime, preprocStart, preprocEnd);
cudaEventElapsedTime(&inferTime, inferStart, inferEnd);
cudaEventElapsedTime(©Time, copyStart, copyEnd);
printf("GPU Preproc: %.2f ms\n", preprocTime);
printf("GPU Infer: %.2f ms\n", inferTime);
printf("GPU→CPU copy: %.2f ms\n", copyTime);
m_ProcessBatchQueue.push(safeRecyleBatch.release());
auto func_end = std::chrono::high_resolution_clock::now();
auto func_duration = std::chrono::duration_cast<std::chrono::microseconds>(func_end - func_start).count() / 1000.0;
printf("[queueInputBatch] Total function time: %.2f ms (batch size: %d)\n", func_duration, batchSize);
return NVDSINFER_SUCCESS;
}
NvDsInferStatus NvDsInferContextImpl::dequeueOutputBatch(NvDsInferContextBatchOutput &batchOutput)
{
auto func_start = std::chrono::high_resolution_clock::now();
assert(m_Initialized);
auto recyleFunc = [this](NvDsInferBatch *batch)
{
if (batch)
m_FreeBatchQueue.push(batch);
};
std::unique_ptr<NvDsInferBatch, decltype(recyleFunc)> recyleBatch(m_ProcessBatchQueue.pop(), recyleFunc);
assert(recyleBatch);
/* Set the cuda device */
RETURN_CUDA_ERR(cudaSetDevice(m_GpuID), "dequeue buffer failed to set cuda device(%s)", m_GpuID);
/* Wait for the copy to the current set of host buffers to complete. */
RETURN_CUDA_ERR(cudaEventSynchronize(*recyleBatch->m_OutputCopyDoneEvent), "Failed to synchronize on cuda copy-coplete-event");
assert(m_Postprocessor);
/* Fill the host buffers information in the output. */
RETURN_NVINFER_ERROR(m_Postprocessor->postProcessHost(*recyleBatch, batchOutput), "postprocessing host buffers failed.");
/* Hold batch private data */
batchOutput.priv = (void *)recyleBatch.release();
auto func_end = std::chrono::high_resolution_clock::now();
auto func_duration = std::chrono::duration_cast<std::chrono::microseconds>(func_end - func_start).count() / 1000.0;
printf("[dequeueOutputBatch] Total function time: %.2f ms \n", func_duration);
return NVDSINFER_SUCCESS;
}
The corresponding output is:
************BATCH-NUM = 51**************
GPU Preproc: 1.28 ms
GPU Infer: 14.47 ms
GPU→CPU copy: 0.00 ms
[queueInputBatch] Total function time: 16.33 ms (batch size: 4)
[dequeueOutputBatch] Total function time: 6.43 ms
************BATCH-NUM = 52**************
GPU Preproc: 1.24 ms
GPU Infer: 14.14 ms
GPU→CPU copy: 0.00 ms
[queueInputBatch] Total function time: 15.93 ms (batch size: 4)
[dequeueOutputBatch] Total function time: 6.38 ms
************BATCH-NUM = 53**************
GPU Preproc: 1.24 ms
GPU Infer: 14.06 ms
GPU→CPU copy: 0.00 ms
[queueInputBatch] Total function time: 15.84 ms (batch size: 4)
[dequeueOutputBatch] Total function time: 6.12 ms
The time required for preprocessing, inference, and post-processing is much shorter than that of the Nvinfer plugin. And my nvinfer is linked to fakesink.There should be no blockage downstream.
I don’t know where the rest of my time was spent, and I couldn’t find it either.