Deepstream Throughput drop due to nvinferserver plugin

Please provide complete information as applicable to your setup.

**• Hardware Platform (Jetson / GPU): GPU
**• DeepStream Version: DS7.0
• JetPack Version (valid for Jetson only)
• TensorRT Version
**• NVIDIA GPU Driver Version (valid for GPU only): L4
**• Issue Type( questions, new requirements, bugs): Questions
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
**• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description): We have observed a throughput drop in our DeepStream pipeline, which appears to be caused by the nvinferserver plugin. Our setup runs Triton and DeepStream in separate containers. The DeepStream pipeline is as follows:
nvmultiurisrcbin → queue → nvinferserver → queue → appsink

From profiling, we found that processing a batch of 32 frames takes approximately 70 milliseconds. Within the nvinferserver , the preprocessing stage is taking over 30 milliseconds (including wait time). The nvinferserver uses two preprocessors: CropSurfaceConverter and NetworkPreprocessor . We have noticed a delay of about 20 milliseconds between these two preprocessors.
Could you help us understand why there is such a significant delay between these two preprocessors in nvinferserver ? Any insights or suggestions to reduce this delay would be appreciated.


profile.zip (84.9 MB)

Note: please rename the profile.zip to profile.7z

Triton model config: name: “ensemble_yolox_s”
platform: “ensemble”
max_batch_size: 32
input [
{
name: “INPUT”
# data_type: TYPE_FP32
data_type: TYPE_UINT8
dims: [ 3,384,640 ]
}
]
output [
{
name: “YOLOX_BBOX”
data_type: TYPE_FP32
dims: [ 7]
}
]
ensemble_scheduling {
step [
{
model_name: “yolox_s_preprocess”
model_version: 1
input_map {
key: “INPUT”
value: “INPUT”
}
output_map {
key: “yolox_image”
value: “yolox_image”
}
},
{
model_name: “yolox_s”
model_version: 1
input_map {
key: “data”
value: “yolox_image”
}
output_map {
key: “stride8”
value: “stride8”
}
output_map {
key: “stride16”
value: “stride16”
}
output_map {
key: “stride32”
value: “stride32”
}
},
{
model_name: “yolox_s_postprocess”
model_version: 1
input_map {
key: “stride8”
value: “stride8”
}
input_map {
key: “stride16”
value: “stride16”
}
input_map {
key: “stride32”
value: “stride32”
}
output_map {
key: “OUTPUT”
value: “YOLOX_BBOX”
}
}
]
}
Infer config:

infer_config {
unique_id: 21
gpu_ids: 0
max_batch_size: 32
backend {
triton {
model_name: “ensemble_yolox_s”
version: -1
grpc {
url: “localhost:8001”
enable_cuda_buffer_sharing: true
}
}
}
preprocess {
network_format: IMAGE_FORMAT_RGB
tensor_order: TENSOR_ORDER_LINEAR
maintain_aspect_ratio: 1
normalize {
scale_factor: 1.0
}
}
postprocess {
labelfile_path: “yolo_normal_labels.json”
other {
}
}
custom_lib {
path: “libai_parsing_library.so”
}
extra {
custom_process_funcion: “CreateInferServerCustomProcessYoloX”
}
}
input_control {
process_mode: PROCESS_MODE_FULL_FRAME
interval: 0
}
output_control {
}



We also printed the NetworkPreprocessor wait time , the printed wait time is much lesser than the delay shown in the profiling.

Can you share the nvinferserver configuration file?

nvinferserver configuration file is already shared.
infer_config {
unique_id: 21
gpu_ids: 0
max_batch_size: 32
backend {
triton {
model_name: “ensemble_yolox_s”
version: -1
grpc {
url: “localhost:8001”
enable_cuda_buffer_sharing: true
}
}
}
preprocess {
network_format: IMAGE_FORMAT_RGB
tensor_order: TENSOR_ORDER_LINEAR
maintain_aspect_ratio: 1
normalize {
scale_factor: 1.0
}
}
postprocess {
labelfile_path: “yolo_normal_labels.json”
other {
}
}
custom_lib {
path: “libai_parsing_library.so”
}
extra {
custom_process_funcion: “CreateInferServerCustomProcessYoloX”
}
}
input_control {
process_mode: PROCESS_MODE_FULL_FRAME
interval: 0
}
output_control {
}

As top this question, the CropSurfaceConverter and NetworkPreprocessor are in different threads, they work in asynchronized mode. The delay between them may be impacted by many factors. Such as the OS thread scheduler, the performance of the other threads which also needs CUDA resources, …

From your profiling graph, the NetworkPreprocessor for other batches happens before the NetworkPreprocessor of the batch you marked.

Can you explain the three models in your 'ensemble‘ model?

In the profiling graph, there is a noticeable delay between the CropSurfaceConverter and the NetworkPreprocessor stages. Both pre-processors show minimal usage of CUDA resources. Profiling from both Triton and DeepStream indicates that the total preprocessing time is comparable to the inference time, which appears to be limiting throughput.

Additionally, the time taken by CropSurfaceConverter fluctuates significantly from batch to batch (ranging from 3 ms to 15 ms), and this variability is also observed in the NetworkPreprocessor stage. While the NetworkPreprocessor wait time for the CropSurfaceConverter is relatively short (as previously shared), there is a consistent delay of around 23 ms before the NetworkPreprocessor begins after the CropSurfaceConverter completes. This gap suggests an opportunity for optimization to improve overall pipeline throughput.

the three models in ensemble are

  1. Yolox Preprocess: Model: yolox_s_preprocess
    Backend : Dali
    Batch size: 32,
    Input: “INPUT”, UINT8, [3,384,640]
    output: “yolox_image”, FP16, [3,384,640] (normalize the image using mean and std values)
    instance_group: 1 instance, GPU 0

  2. Yolox model: Model: yolox_s,
    Platform: tensorrt_plan platform.
    Batch size: 32,
    Input: “data”, FP16, [3, 384, 640]
    output: “stride8”, “stride16”, “stride32” — FP16, multi-scale feature maps for detection
    instance_group: 1 instance, GPU 0

  3. yolox_s_postprocess: Model: yolox_s_postprocess
    Backend: custom_backend
    Batch size: 32,
    input: stride8, stride16, stride32 (FP16 feature maps)
    output: OUTPUT (FP32, [7], final detections(bounding boxes))
    instance_group: 1 instance, GPU 0
    confidence: 0.4 (detections below this threshold are ignored)
    nms_threshold: 0.45 (NMS for overlap removal)
    num_classes: 80

Please use “multi report view” in Nsight systems to view both Triton and DS profiles.

The nvinferserver is just a Triton client.The CropSurfaceConverter and NetworkPreprocessor are all part of preprocess of nvinferserver, after the preprocess, the data will be sent to Triton server to do inferencing and waiting the response from the Triton server to get the inference output. All these things are done asynchronously by sharing buffers between nvinferserver and the Triton server. That is why you set “enable_cuda_buffer_sharing: true” in your configuration file. But the buffer pool size between the nvinferserver and the Triton server is limited. That means the Triton server inferencing latency(including GRPC latency) will impact the nvinferserver preprocessing latency. When all the buffers in the sharing buffer pool are hold in Triton server, there is no buffer available in nvinferserver, nvinferserver needs to wait untill there is buffer be returned back to the pool.

From your nsight data, in the beggining, the NetworkPreprocessor happens right after CropSurfaceConverter because the NetworkPreprocessor can get output buffer from the buffer pool.


But after two batches, the NetworkPreprocessor can only start after “InferComplete", that is because there are only 2 buffers in the buffer pool, the 3rd buffer is only available after the Triton server to release the buffer back to the pool.

So the delay is decided by the Triton server inferencing speed. It is OK to see delay between CropSurfaceConverter and NetworkPreprocessor becuase the Triton server inferencing delay is much more longer than preprocessing, the preprocessing of the next batches happens synchronously with the Triton server inferencing of the previous batches.

The root cause is your ensemble pipeline’s performance is not good enough.

Thanks for your reply. Are you talking about output_buffer_pool_size in inference config. We didn’t observed any change by setting output_buffer_pool_size to 6 or 12. perf_analyzer -m ensemble_yolox_s -i grpc --percentile 95 --async --streaming --shared-memory cuda --concurrency-range 1:4:1 -b 32 -v gives 35% more throughput compared to Deepstream pipeline.

Initially, inference was running in parallel for 5 batches, but later it was only running for 4 batches in parallel. Which configuration setting is responsible for controlling this behavior?

The root cause is your ensemble pipeline’s performance is not good enough. It is no use to changing output_buffer_pool_size. I just want to explain why the delay between " CropSurfaceConverter and NetworkPreprocessor changes since batch 3. No matter what the output_buffer_pool_size value is, the delay will appear after the buffers in pool are all ocuppied by the Triton server side.

What is your input to DeepStream pipeline? Local video file, live stream or others?


The input is local video file. For the ensemble model the throughput was more when we calculated using perf_analyzer. We also observed more throughput when input file was resized. I think preprocess has some impact on throughput. How to increase throughput?

The root cause is your ensemble pipeline’s performance is not good enough. It is no use to changing output_buffer_pool_size. I just want to explain why the delay between " CropSurfaceConverter and NetworkPreprocessor changes since batch 3. No matter what the output_buffer_pool_size value is, the delay will appear after the buffers in pool are all ocuppied by the Triton server side.

Can you try to set “buffer-pool-size” property of nvstreammux to 10 and set the “num-extra-surfaces” property of nvv4l2decoder to 6 in your pipeline?

I didn’t observed any benefit of setting "buffer-pool-size” property of nvstreammux and “num-extra-surfaces” property of nvv4l2decoder.

2 Likes

Can you change the appsink to fakesink in your pipeline and set “sync” property as FALSE with fakesink to test the performance?

We are already testing throughput with "sync" set to False . Replacing appsink with fakesink only resulted in a minor increase—throughput improved by just 5 inferences per second for a pipeline with 32 sources(files).

1 Like