Please provide complete information as applicable to your setup.
**• Hardware Platform (Jetson / GPU): GPU
**• DeepStream Version: DS7.0 • JetPack Version (valid for Jetson only) • TensorRT Version
**• NVIDIA GPU Driver Version (valid for GPU only): L4
**• Issue Type( questions, new requirements, bugs): Questions • How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
**• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description): We have observed a throughput drop in our DeepStream pipeline, which appears to be caused by the nvinferserver plugin. Our setup runs Triton and DeepStream in separate containers. The DeepStream pipeline is as follows:
nvmultiurisrcbin → queue → nvinferserver → queue → appsink
From profiling, we found that processing a batch of 32 frames takes approximately 70 milliseconds. Within the nvinferserver , the preprocessing stage is taking over 30 milliseconds (including wait time). The nvinferserver uses two preprocessors: CropSurfaceConverter and NetworkPreprocessor . We have noticed a delay of about 20 milliseconds between these two preprocessors.
Could you help us understand why there is such a significant delay between these two preprocessors in nvinferserver ? Any insights or suggestions to reduce this delay would be appreciated.
As top this question, the CropSurfaceConverter and NetworkPreprocessor are in different threads, they work in asynchronized mode. The delay between them may be impacted by many factors. Such as the OS thread scheduler, the performance of the other threads which also needs CUDA resources, …
From your profiling graph, the NetworkPreprocessor for other batches happens before the NetworkPreprocessor of the batch you marked.
Can you explain the three models in your 'ensemble‘ model?
In the profiling graph, there is a noticeable delay between the CropSurfaceConverter and the NetworkPreprocessor stages. Both pre-processors show minimal usage of CUDA resources. Profiling from both Triton and DeepStream indicates that the total preprocessing time is comparable to the inference time, which appears to be limiting throughput.
Additionally, the time taken by CropSurfaceConverter fluctuates significantly from batch to batch (ranging from 3 ms to 15 ms), and this variability is also observed in the NetworkPreprocessor stage. While the NetworkPreprocessor wait time for the CropSurfaceConverter is relatively short (as previously shared), there is a consistent delay of around 23 ms before the NetworkPreprocessor begins after the CropSurfaceConverter completes. This gap suggests an opportunity for optimization to improve overall pipeline throughput.
the three models in ensemble are
Yolox Preprocess: Model: yolox_s_preprocess
Backend : Dali
Batch size: 32,
Input: “INPUT”, UINT8, [3,384,640]
output: “yolox_image”, FP16, [3,384,640] (normalize the image using mean and std values)
instance_group: 1 instance, GPU 0
The nvinferserver is just a Triton client.The CropSurfaceConverter and NetworkPreprocessor are all part of preprocess of nvinferserver, after the preprocess, the data will be sent to Triton server to do inferencing and waiting the response from the Triton server to get the inference output. All these things are done asynchronously by sharing buffers between nvinferserver and the Triton server. That is why you set “enable_cuda_buffer_sharing: true” in your configuration file. But the buffer pool size between the nvinferserver and the Triton server is limited. That means the Triton server inferencing latency(including GRPC latency) will impact the nvinferserver preprocessing latency. When all the buffers in the sharing buffer pool are hold in Triton server, there is no buffer available in nvinferserver, nvinferserver needs to wait untill there is buffer be returned back to the pool.
From your nsight data, in the beggining, the NetworkPreprocessor happens right after CropSurfaceConverter because the NetworkPreprocessor can get output buffer from the buffer pool.
But after two batches, the NetworkPreprocessor can only start after “InferComplete", that is because there are only 2 buffers in the buffer pool, the 3rd buffer is only available after the Triton server to release the buffer back to the pool.
So the delay is decided by the Triton server inferencing speed. It is OK to see delay between CropSurfaceConverter and NetworkPreprocessor becuase the Triton server inferencing delay is much more longer than preprocessing, the preprocessing of the next batches happens synchronously with the Triton server inferencing of the previous batches.
The root cause is your ensemble pipeline’s performance is not good enough.
Thanks for your reply. Are you talking about output_buffer_pool_size in inference config. We didn’t observed any change by setting output_buffer_pool_size to 6 or 12. perf_analyzer -m ensemble_yolox_s -i grpc --percentile 95 --async --streaming --shared-memory cuda --concurrency-range 1:4:1 -b 32 -v gives 35% more throughput compared to Deepstream pipeline.
Initially, inference was running in parallel for 5 batches, but later it was only running for 4 batches in parallel. Which configuration setting is responsible for controlling this behavior?
The root cause is your ensemble pipeline’s performance is not good enough. It is no use to changing output_buffer_pool_size. I just want to explain why the delay between " CropSurfaceConverter and NetworkPreprocessor changes since batch 3. No matter what the output_buffer_pool_size value is, the delay will appear after the buffers in pool are all ocuppied by the Triton server side.
What is your input to DeepStream pipeline? Local video file, live stream or others?
The input is local video file. For the ensemble model the throughput was more when we calculated using perf_analyzer. We also observed more throughput when input file was resized. I think preprocess has some impact on throughput. How to increase throughput?
The root cause is your ensemble pipeline’s performance is not good enough. It is no use to changing output_buffer_pool_size. I just want to explain why the delay between " CropSurfaceConverter and NetworkPreprocessor changes since batch 3. No matter what the output_buffer_pool_size value is, the delay will appear after the buffers in pool are all ocuppied by the Triton server side.
Can you try to set “buffer-pool-size” property of nvstreammux to 10 and set the “num-extra-surfaces” property of nvv4l2decoder to 6 in your pipeline?
We are already testing throughput with "sync" set to False . Replacing appsink with fakesink only resulted in a minor increase—throughput improved by just 5 inferences per second for a pipeline with 32 sources(files).