YOLOV4- DS-TRITON | Configuration specified max-batch 4 but TensorRT engine only supports max-batch 1

Environment

TensorRT Version: 7.2.1
NVIDIA GPU: T4
NVIDIA Driver Version: 450.51.06
CUDA Version: 11.1
CUDNN Version: 8.0.4
Operating System: Ubuntu 18.04
Python Version (if applicable): 1.8
Tensorflow Version (if applicable):
PyTorch Version (if applicable): container image nvcr.io/nvidia/pytorch:20.11-py3
Baremetal or Container (if so, version): container image deepstream:5.1-21.02-triton

In regards to the === Build and Inference Batch Options === in trtexec. What options should I use to build the engine with dynamic input shapes, so it can be deployed later on with DS-Triton with Bs >1?. Right now I am getting the error TensorRT engine only supports max-batch 1 with DS

See below:
Build the engine with dymanic shapes:
$ /usr/src/tensorrt/bin/trtexec --onnx=yolov4_-1_3_608_608_dynamic.onnx --explicitBatch --minShapes=\'input\':1x3x608x608 --optShapes=\'input\':4x3x608x608 --maxShapes=\'input\':8x3x608x608 --workspace=4096 --saveEngine=yolov4_-1_3_608_608_dynamic_int8_.engine --int8

Run the inference with trtexec and default batch size
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4_-1_3_608_608_dynamic_onnx_int8_trtexec_4.engine --int8
Result:

[03/16/2021-00:23:14] [I] Host Latency
[03/16/2021-00:23:14] [I] min: 7.01904 ms (end to end 12.0889 ms)
[03/16/2021-00:23:14] [I] max: 7.89343 ms (end to end 13.8339 ms)
[03/16/2021-00:23:14] [I] mean: 7.15021 ms (end to end 12.3533 ms)
[03/16/2021-00:23:14] [I] median: 7.09982 ms (end to end 12.2517 ms)
[03/16/2021-00:23:14] [I] percentile: 7.88986 ms at 99% (end to end 13.818 ms at 99%)
[03/16/2021-00:23:14] [I] throughput: 160.912 qps
[03/16/2021-00:23:14] [I] walltime: 3.02029 s
[03/16/2021-00:23:14] [I] Enqueue Time
[03/16/2021-00:23:14] [I] min: 1.4646 ms
[03/16/2021-00:23:14] [I] max: 1.79004 ms
[03/16/2021-00:23:14] [I] median: 1.48828 ms
[03/16/2021-00:23:14] [I] GPU Compute
[03/16/2021-00:23:14] [I] min: 6.0675 ms
[03/16/2021-00:23:14] [I] max: 6.93729 ms
[03/16/2021-00:23:14] [I] mean: 6.1978 ms
[03/16/2021-00:23:14] [I] median: 6.14783 ms
[03/16/2021-00:23:14] [I] percentile: 6.9351 ms at 99%
[03/16/2021-00:23:14] [I] total compute time: 3.01213 s

Print the engine’s input and output shapes:

input shape :  (-1, 3, 608, 608)
out shape :  (-1, 22743, 1, 4)

Deploy the engine with DS

Run inference on DS with max_batch_size=1
$ deepstream-app -c source1_primary_yolov4.txt

I0316 01:15:35.232182 159 model_repository_manager.cc:810] loading: yolov4_nvidia:1
I0316 01:15:46.895954 159 plan_backend.cc:333] Creating instance yolov4_nvidia_0_0_gpu0 on GPU 0 (7.5) using yolov4_-1_3_608_608_dynamic_onnx_int8_trtexec_4.engine
I0316 01:15:47.333165 159 plan_backend.cc:666] Created instance yolov4_nvidia_0_0_gpu0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0316 01:15:47.334265 159 model_repository_manager.cc:983] successfully loaded 'yolov4_nvidia' version 1
INFO: infer_trtis_backend.cpp:206 TrtISBackend id:1 initialized model: yolov4_nvidia

Runtime commands:
        h: Print this help
        q: Quit

        p: Pause
        r: Resume

NOTE: To expand a source in the 2D tiled display and view object details, left-click on the source.
      To go back to the tiled display, right-click anywhere on the window.


**PERF:  FPS 0 (Avg)
**PERF:  0.00 (0.00)
** INFO: <bus_callback:181>: Pipeline ready

** INFO: <bus_callback:167>: Pipeline running

**PERF:  138.28 (138.17)
**PERF:  141.00 (139.60)
** INFO: <bus_callback:204>: Received EOS. Exiting ...

Quitting
I0316 01:16:00.336337 159 model_repository_manager.cc:837] unloading: yolov4_nvidia:1
I0316 01:16:00.338973 159 server.cc:280] Waiting for in-flight requests to complete.
I0316 01:16:00.338986 159 server.cc:295] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0316 01:16:00.378079 159 model_repository_manager.cc:966] successfully unloaded 'yolov4_nvidia' version 1
I0316 01:16:01.339052 159 server.cc:295] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
App run successful

Run inference on DS with max_batch_size=4
$ deepstream-app -c source1_primary_yolov4.txt
Error:

E0316 01:20:12.879238 195 model_repository_manager.cc:1705] unable to autofill for 'yolov4_nvidia', configuration specified max-batch 4 but TensorRT engine only supports max-batch 1
ERROR: infer_trtis_server.cpp:1044 Triton: failed to load model yolov4_nvidia, triton_err_str:Internal, err_msg:failed to load 'yolov4_nvidia', no version is available
ERROR: infer_trtis_backend.cpp:45 failed to load model: yolov4_nvidia, nvinfer error:NVDSINFER_TRTIS_ERROR
ERROR: infer_trtis_backend.cpp:184 failed to initialize backend while ensuring model:yolov4_nvidia ready, nvinfer error:NVDSINFER_TRTIS_ERROR
0:00:14.484600140   195 0x56007e1c7cf0 ERROR          nvinferserver gstnvinferserver.cpp:362:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in createNNBackend() <infer_trtis_context.cpp:246> [UID = 1]: failed to initialize trtis backend for model:yolov4_nvidia, nvinfer error:NVDSINFER_TRTIS_ERROR
I0316 01:20:12.879481 195 server.cc:280] Waiting for in-flight requests to complete.
I0316 01:20:12.879488 195 server.cc:295] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
0:00:14.484704250   195 0x56007e1c7cf0 ERROR          nvinferserver gstnvinferserver.cpp:362:gst_nvinfer_server_logger:<primary_gie> nvinferserver[UID 1]: Error in initialize() <infer_base_context.cpp:81> [UID = 1]: create nn-backend failed, check config file settings, nvinfer error:NVDSINFER_TRTIS_ERROR
0:00:14.484716684   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver_impl.cpp:439:start:<primary_gie> error: Failed to initialize InferTrtIsContext
0:00:14.484722696   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver_impl.cpp:439:start:<primary_gie> error: Config file path: /workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis/config_infer_primary_yolov4.txt
0:00:14.485106084   195 0x56007e1c7cf0 WARN           nvinferserver gstnvinferserver.cpp:460:gst_nvinfer_server_start:<primary_gie> error: gstnvinferserver_impl start failed
** ERROR: <main:655>: Failed to set pipeline to PAUSED
Quitting
ERROR from primary_gie: Failed to initialize InferTrtIsContext
Debug info: gstnvinferserver_impl.cpp(439): start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie:
Config file path: /workspace/Deepstream_5.1_Triton/samples/configs/deepstream-app-trtis/config_infer_primary_yolov4.txt
ERROR from primary_gie: gstnvinferserver_impl start failed
Debug info: gstnvinferserver.cpp(460): gst_nvinfer_server_start (): /GstPipeline:pipeline/GstBin:primary_gie_bin/GstNvInferServer:primary_gie
App run failed

Hey, Could you share your config files with us?

Hi @bcao, thanks for the quick answer. Please find enclosed the files and below the polygraphy of the INT8 models with dynamic and static input shape:

Model with Dynamic Batching 1-4-8
$ polygraphy inspect model yolov4_-1_3_608_608_dynamic_onnx_int8_trtexec_4.engine

[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine (430 layers)

    ---- 1 Engine Inputs ----
    {input [dtype=float32, shape=(-1, 3, 608, 608)]}

    ---- 2 Engine Outputs ----
    {boxes [dtype=float32, shape=(-1, 22743, 1, 4)], confs [dtype=float32, shape=(-1, 22743, 80)]}

    ---- Memory ----
    Workspace Memory: 0 bytes

    ---- 1 Profiles (3 Bindings Each) ----
    - Profile: 0
        Binding Index: 0 (Input)  [Name: input] | Shapes: min=(1, 3, 608, 608), opt=(4, 3, 608, 608), max=(8, 3, 608, 608)
        Binding Index: 1 (Output) [Name: boxes] | Shape: (-1, 22743, 1, 4)    Binding Index: 2 (Output) [Name: confs] | Shape: (-1, 22743, 80)

Model with Static Batching
Also, I have tried to deploy the model in DS-Triton with static input shape BS=64, but got the error failed to load 'yolov4_nvidia' version 1: Internal: trt failed to set binding dimension to [1,3,608,608] for input 'input' for yolov4_nvidia_0_0_gpu0. See below its polygraphy too:

$ polygraphy inspect model yolov4_64_3_608_608_static_onnx.engine

[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine (431 layers)

    ---- 1 Engine Inputs ----
    {input [dtype=float32, shape=(64, 3, 608, 608)]}

    ---- 2 Engine Outputs ----
    {boxes [dtype=float32, shape=(64, 22743, 1, 4)], confs [dtype=float32, shape=(64, 22743, 80)]}

    ---- Memory ----
    Workspace Memory: 0 bytes

    ---- 1 Profiles (3 Bindings Each) ----
    - Profile: 0
        Binding Index: 0 (Input)  [Name: input] | Shapes: min=(64, 3, 608, 608), opt=(64, 3, 608, 608), max=(64, 3, 608, 608)
        Binding Index: 1 (Output) [Name: boxes] | Shape: (64, 22743, 1, 4)    Binding Index: 2 (Output) [Name: confs] | Shape: (64, 22743, 80)

source1_primary_yolov4.txt (3.9 KB) config_infer_primary_yolov4.txt (1.3 KB) config.pbtxt (711 Bytes)

I need to deploy the optimized model in INT8 mode with DS-Triton to boost performance using dynamic batching and concurrency; right now the models (with static and dynamic input shape) only work with with BS=1 and concurrency=1

Thanks, will check and update ASAP.

Hi @bcao, do you have some updates?. Thanks!

Hey Virsg,
Sorry for the late, checked your configs and also could you help to share your onnx model with us and which post processor(NvDsInferParseCustomYoloV4) you are using

Hi @bcao, you can download the static and dynamic onnx models from here. also, I am using the parse from the Nvidia sample NVIDIA-AI-IOT/yolov4_deepstream. Please see below the steps used to build and configure it:

$ sudo git clone https://github.com/NVIDIA-AI-IOT/yolov4_deepstream.git
$ sudo cp -r "/yolov4_deepstream/deepstream_yolov4/" "/workspace/Deepstream_5.1_Triton/sources/"
$ cd /workspace/Deepstream_5.1_Triton/sources/deepstream_yolov4/nvdsinfer_custom_impl_Yolo
$ export CUDA_VER=11.1
$ make
# update custom plugin at the file config_infer_primary_yolov4.txt as:

  postprocess {
    labelfile_path: "../../trtis_model_repo/yolov4_nvidia/labels.txt"
    detection {
      num_detected_classes: 80
      custom_parse_bbox_func: "NvDsInferParseCustomYoloV4"
      nms {
        confidence_threshold: 0.3
        iou_threshold: 0.6
        topk : 100
      }

  custom_lib {
    path: "/workspace/Deepstream_5.1_Triton/sources/deepstream_yolov4/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so"
  }

Hey customer, you need to change some config items to run the yolov4 engine file with dynamic batch size, for config.pbtxt, you need apply following:

root@p4station:/home/bcao/customer-bug/triton-bs# diff config.pbtxt config.pbtxt.fix
6c6,7
< max_batch_size: 4
---
> max_batch_size: 0
>
8,11d8
< dynamic_batching {
<     preferred_batch_size: [ 1, 4 ]
<     max_queue_delay_microseconds: 100
< }
16,17c13
<     format: FORMAT_NCHW
<     dims: [ 3, 608, 608 ]
---
>     dims: [-1, 3, 608, 608 ]
24c20
<     dims: [ 22743, 1, 4 ]
---
>     dims: [-1, 22743, 1, 4 ]
29c25
<     dims: [ 22743, 80 ]
---
>     dims: [-1, 22743, 80 ]

And for config_infer_primary_yolov4.txt you need to change the tensor_order

root@p4station:/home/bcao/customer-bug/triton-bs# diff config_infer_primary_yolov4.txt config_infer_primary_yolov4.txt.fix
20c20
<     tensor_order: TENSOR_ORDER_NONE
---
>     tensor_order: TENSOR_ORDER_LINEAR

I had shared all the correct configs for you, you can find the details inside them.
config.pbtxt.fix (598 Bytes) config_infer_primary_yolov4.txt.fix (1.3 KB) source4_primary_yolov4.txt (3.9 KB)

Hi @bca, thanks for the feedback, I have run some experiments with the fixed config files you have provided; however, the deployment didn’t show performance boost with BS>1 and count instances >1, and printed the warning W0324 19:24:44.954881 4238 autofill.cc:190] The specified dimensions in model config for yolov4_nvidia hints that batching is unavailable, please see below the table with the results:

In my mind there are two possible root causes of the poor performance:

  • Upstream root cause: the tool used to optimize the onnx model, trtexec
  • Downstream root cause: the post processor(NvDsInferParseCustomYoloV4) and parser used from the Nvidia sample NVIDIA-AI-IOT/yolov4_deepstream (which is based on DS5.0)

Hey customer,
I have some questions in your table:
Runtime inference batch size is the max_batch_size in config_infer_primary_yolov4.txt?
Sources is the num-sources in source4_primary_yolov4.txt?
Then what is the Count instances?

BTW, for better performance you should make sure the num-sources == streammux batch-size(under [streammux] in source4_primary_yolov4.txt) == max_batch_size (in config_infer_primary_yolov4.txt)

Hi @bcao, thanks for the follow-up

in my table:

  • Runtime inference batch size is the batch-size in source4_primary_yolov4.txt
  • Yes, Sources is the num-sources in source4_primary_yolov4.txt
  • Count instances is in the count in the instance_group at the config.pbtxt file
  • Yes, we are following the rule “num-sources == streammux batch-size(under [streammux] in source4_primary_yolov4.txt) == max_batch_size (in config_infer_primary_yolov4.txt)”
  1. check nvidia-smi dmont to watch batch1 and batch4 and check whether it’s sm bound. also check clock whether it’s maxized.

  2. trtexec. try implicit-batch, increase workspace size. and int8 doesn’t means all layers goes to int8, when enable int8, just means layers have options to select from int8/fp16/fp32.

  3. if you think about postprocessing(CPU) might slow down the perf. try disable all,

 postprocess {
    other {}
  }
  extra {
    copy_input_to_host_buffers: false
    output_buffer_pool_size: 6
  }

In addition, per your 1st comment, the model perf with trtexec is 161 fps, but it only include the inferece stage. However the perf mearsued by ds-pipeline include pre/postprocess and inference, so you also need to check the perf using trtecec and see what’s the difference, it’s expected the perf of trtecec will be a little better than the perf with ds-pipeline

Hi @bcao, thanks for the feedback. I have some extra questions:

  1. How can I retrieve the total latency (pre/postprocess and inference) when running the reference application deepstream-app with DS-Triton pipeline?
  2. How can I check what layers were selected from int8/fp16/fp32 after the optimization with TensorRT?
  3. What profiling tool is recommended to run the inference and extract further hardware information: resource utilization on CPU, CPU memory, GPU, GPU memory?