YOLOV4- DS-TRITON | Configuration specified max-batch 4 but TensorRT engine only supports max-batch 1

virsg · March 16, 2021, 3:50am

Environment

TensorRT Version: 7.2.1
NVIDIA GPU: T4
NVIDIA Driver Version: 450.51.06
CUDA Version: 11.1
CUDNN Version: 8.0.4
Operating System: Ubuntu 18.04
Python Version (if applicable): 1.8
Tensorflow Version (if applicable):
PyTorch Version (if applicable): container image nvcr.io/nvidia/pytorch:20.11-py3
Baremetal or Container (if so, version): container image deepstream:5.1-21.02-triton

In regards to the === Build and Inference Batch Options === in trtexec. What options should I use to build the engine with dynamic input shapes, so it can be deployed later on with DS-Triton with Bs >1?. Right now I am getting the error TensorRT engine only supports max-batch 1 with DS

See below:
Build the engine with dymanic shapes:
$ /usr/src/tensorrt/bin/trtexec --onnx=yolov4_-1_3_608_608_dynamic.onnx --explicitBatch --minShapes=\'input\':1x3x608x608 --optShapes=\'input\':4x3x608x608 --maxShapes=\'input\':8x3x608x608 --workspace=4096 --saveEngine=yolov4_-1_3_608_608_dynamic_int8_.engine --int8

Run the inference with trtexec and default batch size
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic_onnx_int8_trtexec_4.engine --int8
Result:

[03/16/2021-00:23:14] [I] Host Latency
[03/16/2021-00:23:14] [I] min: 7.01904 ms (end to end 12.0889 ms)
[03/16/2021-00:23:14] [I] max: 7.89343 ms (end to end 13.8339 ms)
[03/16/2021-00:23:14] [I] mean: 7.15021 ms (end to end 12.3533 ms)
[03/16/2021-00:23:14] [I] median: 7.09982 ms (end to end 12.2517 ms)
[03/16/2021-00:23:14] [I] percentile: 7.88986 ms at 99% (end to end 13.818 ms at 99%)
[03/16/2021-00:23:14] [I] throughput: 160.912 qps
[03/16/2021-00:23:14] [I] walltime: 3.02029 s
[03/16/2021-00:23:14] [I] Enqueue Time
[03/16/2021-00:23:14] [I] min: 1.4646 ms
[03/16/2021-00:23:14] [I] max: 1.79004 ms
[03/16/2021-00:23:14] [I] median: 1.48828 ms
[03/16/2021-00:23:14] [I] GPU Compute
[03/16/2021-00:23:14] [I] min: 6.0675 ms
[03/16/2021-00:23:14] [I] max: 6.93729 ms
[03/16/2021-00:23:14] [I] mean: 6.1978 ms
[03/16/2021-00:23:14] [I] median: 6.14783 ms
[03/16/2021-00:23:14] [I] percentile: 6.9351 ms at 99%
[03/16/2021-00:23:14] [I] total compute time: 3.01213 s

Print the engine’s input and output shapes:

input shape :  (-1, 3, 608, 608)
out shape :  (-1, 22743, 1, 4)

Deploy the engine with DS

Run inference on DS with max_batch_size=1
$ deepstream-app -c source1_primary_yolov4.txt

I0316 01:15:35.232182 159 model_repository_manager.cc:810] loading: yolov4_nvidia:1
I0316 01:15:46.895954 159 plan_backend.cc:333] Creating instance yolov4_nvidia_0_0_gpu0 on GPU 0 (7.5) using yolov4_-1_3_608_608_dynamic_onnx_int8_trtexec_4.engine
I0316 01:15:47.333165 159 plan_backend.cc:666] Created instance yolov4_nvidia_0_0_gpu0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0316 01:15:47.334265 159 model_repository_manager.cc:983] successfully loaded 'yolov4_nvidia' version 1
INFO: infer_trtis_backend.cpp:206 TrtISBackend id:1 initialized model: yolov4_nvidia

Runtime commands:
        h: Print this help
        q: Quit

        p: Pause
        r: Resume

NOTE: To expand a source in the 2D tiled display and view object details, left-click on the source.
      To go back to the tiled display, right-click anywhere on the window.


**PERF:  FPS 0 (Avg)
**PERF:  0.00 (0.00)
** INFO: <bus_callback:181>: Pipeline ready

** INFO: <bus_callback:167>: Pipeline running

**PERF:  138.28 (138.17)
**PERF:  141.00 (139.60)
** INFO: <bus_callback:204>: Received EOS. Exiting ...

Quitting
I0316 01:16:00.336337 159 model_repository_manager.cc:837] unloading: yolov4_nvidia:1
I0316 01:16:00.338973 159 server.cc:280] Waiting for in-flight requests to complete.
I0316 01:16:00.338986 159 server.cc:295] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0316 01:16:00.378079 159 model_repository_manager.cc:966] successfully unloaded 'yolov4_nvidia' version 1
I0316 01:16:01.339052 159 server.cc:295] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
App run successful

Run inference on DS with max_batch_size=4
$ deepstream-app -c source1_primary_yolov4.txt
Error:

E0316 01:20:12.879238 195 model_repository_manager.cc:1705] unable to autofill for 'yolov4_nvidia', configuration specified max-batch 4 but TensorRT engine only supports max-batch 1
ERROR: infer_trtis_server.cpp:1044 Triton: failed to load model yolov4_nvidia, triton_err_str:Internal, err_msg:failed to load 'yolov4_nvidia', no version is available
ERROR: infer_trtis_backend.cpp:45 failed to load model: yolov4_nvidia, nvinfer error:NVDSINFER_TRTIS_ERROR
ERROR: infer_trtis_backend.cpp:184 failed to initialize backend while ensuring model:yolov4_nvidia ready, nvinfer error:NVDSINFER_TRTIS_ERROR
0:00:14.484600140   195 0x56007e1c7cf0 ERROR          nvinferserver gstnvinferserver.cpp:362:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in createNNBackend() <infer_trtis_context.cpp:246> [UID = 1]: failed to initialize trtis backend for model:yolov4_nvidia, nvinfer error:NVDSINFER_TRTIS_ERROR
I0316 01:20:12.879481 195 server.cc:280] Waiting for in-flight requests to complete.
I0316 01:20:12.879488 195 server.cc:295] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
0:00:14.484704250   195 0x56007e1c7cf0 ERROR          nvinferserver gstnvinferserver.cpp:362:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in initialize() <infer_base_context.cpp:81> [UID = 1]: create nn-backend failed, check config file settings, nvinfer error:NVDSINFER_TRTIS_ERROR
0:00:14.484716684   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver_impl.cpp:439:start:<primary_gie> error: Failed to initialize InferTrtIsContext
0:00:14.484722696   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver_impl.cpp:439:start:<primary_gie> error: Config file path: /workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis/config_infer_primary_yolov4.txt
0:00:14.485106084   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver.cpp:460:gst_nvinfer_server_start:<primary_gie> error: gstnvinferserver_impl start failed
** ERROR: <main:655>: Failed to set pipeline to PAUSED
Quitting
ERROR from primary_gie: Failed to initialize InferTrtIsContext
Debug info: gstnvinferserver_impl.cpp(439): start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie:
Config file path: /workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis/config_infer_primary_yolov4.txt
ERROR from primary_gie: gstnvinferserver_impl start failed
Debug info: gstnvinferserver.cpp(460): gst_nvinfer_server_start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie
App run failed

bcao · March 16, 2021, 6:46am

Hey, Could you share your config files with us?

virsg · March 16, 2021, 7:23am

Hi @bcao, thanks for the quick answer. Please find enclosed the files and below the polygraphy of the INT8 models with dynamic and static input shape:

Model with Dynamic Batching 1-4-8
$ polygraphy inspect model yolov4_-1_3_608_608_dynamic_onnx_int8_trtexec_4.engine

[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine (430 layers)

    ---- 1 Engine Inputs ----
    {input [dtype=float32, shape=(-1, 3, 608, 608)]}

    ---- 2 Engine Outputs ----
    {boxes [dtype=float32, shape=(-1, 22743, 1, 4)], confs [dtype=float32, shape=(-1, 22743, 80)]}

    ---- Memory ----
    Workspace Memory: 0 bytes

    ---- 1 Profiles (3 Bindings Each) ----
    - Profile: 0
        Binding Index: 0 (Input)  [Name: input] | Shapes: min=(1, 3, 608, 608), opt=(4, 3, 608, 608), max=(8, 3, 608, 608)
        Binding Index: 1 (Output) [Name: boxes] | Shape: (-1, 22743, 1, 4)    Binding Index: 2 (Output) [Name: confs] | Shape: (-1, 22743, 80)

Model with Static Batching
Also, I have tried to deploy the model in DS-Triton with static input shape BS=64, but got the error failed to load 'yolov4_nvidia' version 1: Internal: trt failed to set binding dimension to [1,3,608,608] for input 'input' for yolov4_nvidia_0_0_gpu0. See below its polygraphy too:

$ polygraphy inspect model yolov4_64_3_608_608_static_onnx.engine

[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine (431 layers)

    ---- 1 Engine Inputs ----
    {input [dtype=float32, shape=(64, 3, 608, 608)]}

    ---- 2 Engine Outputs ----
    {boxes [dtype=float32, shape=(64, 22743, 1, 4)], confs [dtype=float32, shape=(64, 22743, 80)]}

    ---- Memory ----
    Workspace Memory: 0 bytes

    ---- 1 Profiles (3 Bindings Each) ----
    - Profile: 0
        Binding Index: 0 (Input)  [Name: input] | Shapes: min=(64, 3, 608, 608), opt=(64, 3, 608, 608), max=(64, 3, 608, 608)
        Binding Index: 1 (Output) [Name: boxes] | Shape: (64, 22743, 1, 4)    Binding Index: 2 (Output) [Name: confs] | Shape: (64, 22743, 80)

source1_primary_yolov4.txt (3.9 KB) config_infer_primary_yolov4.txt (1.3 KB) config.pbtxt (711 Bytes)

I need to deploy the optimized model in INT8 mode with DS-Triton to boost performance using dynamic batching and concurrency; right now the models (with static and dynamic input shape) only work with with BS=1 and concurrency=1

bcao · March 16, 2021, 7:29am

Thanks, will check and update ASAP.

virsg · March 22, 2021, 2:36pm

Hi @bcao, do you have some updates?. Thanks!

bcao · March 23, 2021, 1:52am

Hey Virsg,
Sorry for the late, checked your configs and also could you help to share your onnx model with us and which post processor(NvDsInferParseCustomYoloV4) you are using

virsg · March 23, 2021, 2:01pm

Hi @bcao, you can download the static and dynamic onnx models from here. also, I am using the parse from the Nvidia sample NVIDIA-AI-IOT/yolov4_deepstream. Please see below the steps used to build and configure it:

$ sudo git clone https://github.com/NVIDIA-AI-IOT/yolov4_deepstream.git
$ sudo cp -r "/yolov4_deepstream/deepstream_yolov4/" "/workspace/Deepstream_5.1_Triton/sources/"
$ cd /workspace/Deepstream_5.1_Triton/sources/deepstream_yolov4/nvdsinfer_custom_impl_Yolo
$ export CUDA_VER=11.1
$ make
# update custom plugin at the file config_infer_primary_yolov4.txt as:

  postprocess {
    labelfile_path: "../../trtis_model_repo/yolov4_nvidia/labels.txt"
    detection {
      num_detected_classes: 80
      custom_parse_bbox_func: "NvDsInferParseCustomYoloV4"
      nms {
        confidence_threshold: 0.3
        iou_threshold: 0.6
        topk : 100
      }

  custom_lib {
    path: "/workspace/Deepstream_5.1_Triton/sources/deepstream_yolov4/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so"
  }

bcao · March 24, 2021, 7:30am

Hey customer, you need to change some config items to run the yolov4 engine file with dynamic batch size, for config.pbtxt, you need apply following:

root@p4station:/home/bcao/customer-bug/triton-bs# diff config.pbtxt config.pbtxt.fix
6c6,7
< max_batch_size: 4
---
> max_batch_size: 0
>
8,11d8
< dynamic_batching {
<     preferred_batch_size: [ 1, 4 ]
<     max_queue_delay_microseconds: 100
< }
16,17c13
<     format: FORMAT_NCHW
<     dims: [ 3, 608, 608 ]
---
>     dims: [-1, 3, 608, 608 ]
24c20
<     dims: [ 22743, 1, 4 ]
---
>     dims: [-1, 22743, 1, 4 ]
29c25
<     dims: [ 22743, 80 ]
---
>     dims: [-1, 22743, 80 ]

And for config_infer_primary_yolov4.txt you need to change the tensor_order

root@p4station:/home/bcao/customer-bug/triton-bs# diff config_infer_primary_yolov4.txt config_infer_primary_yolov4.txt.fix
20c20
<     tensor_order: TENSOR_ORDER_NONE
---
>     tensor_order: TENSOR_ORDER_LINEAR

I had shared all the correct configs for you, you can find the details inside them.
config.pbtxt.fix (598 Bytes) config_infer_primary_yolov4.txt.fix (1.3 KB) source4_primary_yolov4.txt (3.9 KB)

virsg · March 24, 2021, 4:04pm

Hi @bca, thanks for the feedback, I have run some experiments with the fixed config files you have provided; however, the deployment didn’t show performance boost with BS>1 and count instances >1, and printed the warning W0324 19:24:44.954881 4238 autofill.cc:190] The specified dimensions in model config for yolov4_nvidia hints that batching is unavailable, please see below the table with the results:

In my mind there are two possible root causes of the poor performance:

Upstream root cause: the tool used to optimize the onnx model, trtexec
Downstream root cause: the post processor(NvDsInferParseCustomYoloV4) and parser used from the Nvidia sample NVIDIA-AI-IOT/yolov4_deepstream (which is based on DS5.0)

bcao · March 25, 2021, 2:51am

Hey customer,
I have some questions in your table:
Runtime inference batch size is the max_batch_size in config_infer_primary_yolov4.txt?
Sources is the num-sources in source4_primary_yolov4.txt?
Then what is the Count instances?

BTW, for better performance you should make sure the num-sources == streammux batch-size(under [streammux] in source4_primary_yolov4.txt) == max_batch_size (in config_infer_primary_yolov4.txt)

virsg · March 25, 2021, 3:05am

Hi @bcao, thanks for the follow-up

in my table:

Runtime inference batch size is the batch-size in source4_primary_yolov4.txt
Yes, Sources is the num-sources in source4_primary_yolov4.txt
Count instances is in the count in the instance_group at the config.pbtxt file
Yes, we are following the rule “num-sources == streammux batch-size(under [streammux] in source4_primary_yolov4.txt) == max_batch_size (in config_infer_primary_yolov4.txt)”

bcao · March 25, 2021, 4:15am

check nvidia-smi dmont to watch batch1 and batch4 and check whether it’s sm bound. also check clock whether it’s maxized.
trtexec. try implicit-batch, increase workspace size. and int8 doesn’t means all layers goes to int8, when enable int8, just means layers have options to select from int8/fp16/fp32.
if you think about postprocessing(CPU) might slow down the perf. try disable all,

 postprocess {
    other {}
  }
  extra {
    copy_input_to_host_buffers: false
    output_buffer_pool_size: 6
  }

In addition, per your 1st comment, the model perf with trtexec is 161 fps, but it only include the inferece stage. However the perf mearsued by ds-pipeline include pre/postprocess and inference, so you also need to check the perf using trtecec and see what’s the difference, it’s expected the perf of trtecec will be a little better than the perf with ds-pipeline

virsg · May 5, 2021, 9:27pm

Hi @bcao, thanks for the feedback. I have some extra questions:

How can I retrieve the total latency (pre/postprocess and inference) when running the reference application deepstream-app with DS-Triton pipeline?
How can I check what layers were selected from int8/fp16/fp32 after the optimization with TensorRT?
What profiling tool is recommended to run the inference and extract further hardware information: resource utilization on CPU, CPU memory, GPU, GPU memory?

Topic		Replies	Views
Converting yolov4 onnx model to TensorRT for multi batch input TensorRT cudnn	3	648	January 31, 2024
Yolov8seg giving divide by 0 errors if no detection in frame DeepStream SDK	11	752	November 7, 2023
Iplugin tensorrt engine error for ds5.0 DeepStream SDK	29	4151	October 12, 2021
Utilizing Inference server for multi-batch processing with deepstream DeepStream SDK gstreamer , inference-server-triton , deepstream61	11	1121	October 19, 2023
What kind of hardware rigs can support 100+ videos analytics using deepstream? DeepStream SDK hw	30	1805	October 12, 2021
Trying to convert Yolov8.onnx into trt ( TensorRT version : 8.2, jetson-jetpack : 4.6.1) Jetson Xavier NX tensorrt , cuda , yolo	12	3373	May 17, 2023
Instructions to integrate TAO 3.0 YoloV4 model into DeepStream produce no output on Jetson NX DeepStream SDK	10	387	December 5, 2023
TensorRT Batch Inference: different results TensorRT	4	4210	December 1, 2021
Get wrong infer results while testing yolov4 on deepstream 5.0 DeepStream SDK	46	9379	October 12, 2021
"scratch TensorRT API network + non-supported layer plugin" is not working in deepstream sdk 5.0 DeepStream SDK	4	1198	October 12, 2021

YOLOV4- DS-TRITON | Configuration specified max-batch 4 but TensorRT engine only supports max-batch 1

Environment

Related topics