Converting yolov4 onnx model to TensorRT for multi batch input

Description

I’m looking to convert a yolov4 model from Onnx model zoo to tensorflow using TensorRT for use in Deepstream.
I’m looking to use this for streaming from multiple sources and so I want to convert it to use a batch size >1. I also have a question about the process:

Environment

TensorRT Version: v8001
GPU Type: Jetson NX
Nvidia Driver Version: r32.6.1
CUDA Version: 10.2
CUDNN Version:
Operating System + Version: Ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): r32.6.1-samples

Steps To Reproduce

Download yolov4 onnx file from onnx model zoo
Run trtexec for multiple batch size (from GitHub - isarsoft/yolov4-triton-tensorrt: This repository deploys YOLOv4 as an optimized TensorRT engine to Triton Inference Server)
Test with:
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov4.engine --batch=4 --iterations=100 --avgRuns=10 --dumpProfile --dumpOutput --useCudaGraph

and I see the output:
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov4.engine --batch=4 --iterations=100 --avgRuns=10 --dumpProfile --dumpOutput --useCudaGraph
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4.engine --batch=4 --iterations=100 --avgRuns=10 --dumpProfile --dumpOutput --useCudaGraph --plugins=liblayerplugin.so
[11/17/2023-11:56:18] [I] === Model Options ===
[11/17/2023-11:56:18] [I] Format: *
[11/17/2023-11:56:18] [I] Model:
[11/17/2023-11:56:18] [I] Output:
[11/17/2023-11:56:18] [I] === Build Options ===
[11/17/2023-11:56:18] [I] Max batch: 4
[11/17/2023-11:56:18] [I] Workspace: 16 MiB
[11/17/2023-11:56:18] [I] minTiming: 1
[11/17/2023-11:56:18] [I] avgTiming: 8
[11/17/2023-11:56:18] [I] Precision: FP32
[11/17/2023-11:56:18] [I] Calibration:
[11/17/2023-11:56:18] [I] Refit: Disabled
[11/17/2023-11:56:18] [I] Sparsity: Disabled
[11/17/2023-11:56:18] [I] Safe mode: Disabled
[11/17/2023-11:56:18] [I] Restricted mode: Disabled
[11/17/2023-11:56:18] [I] Save engine:
[11/17/2023-11:56:18] [I] Load engine: yolov4.engine
[11/17/2023-11:56:18] [I] NVTX verbosity: 0
[11/17/2023-11:56:18] [I] Tactic sources: Using default tactic sources
[11/17/2023-11:56:18] [I] timingCacheMode: local
[11/17/2023-11:56:18] [I] timingCacheFile:
[11/17/2023-11:56:18] [I] Input(s)s format: fp32:CHW
[11/17/2023-11:56:18] [I] Output(s)s format: fp32:CHW
[11/17/2023-11:56:18] [I] Input build shapes: model
[11/17/2023-11:56:18] [I] Input calibration shapes: model
[11/17/2023-11:56:18] [I] === System Options ===
[11/17/2023-11:56:18] [I] Device: 0
[11/17/2023-11:56:18] [I] DLACore:
[11/17/2023-11:56:18] [I] Plugins: liblayerplugin.so
[11/17/2023-11:56:18] [I] === Inference Options ===
[11/17/2023-11:56:18] [I] Batch: 4
[11/17/2023-11:56:18] [I] Input inference shapes: model
[11/17/2023-11:56:18] [I] Iterations: 100
[11/17/2023-11:56:18] [I] Duration: 3s (+ 200ms warm up)
[11/17/2023-11:56:18] [I] Sleep time: 0ms
[11/17/2023-11:56:18] [I] Streams: 1
[11/17/2023-11:56:18] [I] ExposeDMA: Disabled
[11/17/2023-11:56:18] [I] Data transfers: Enabled
[11/17/2023-11:56:18] [I] Spin-wait: Disabled
[11/17/2023-11:56:18] [I] Multithreading: Disabled
[11/17/2023-11:56:18] [I] CUDA Graph: Enabled
[11/17/2023-11:56:18] [I] Separate profiling: Disabled
[11/17/2023-11:56:18] [I] Time Deserialize: Disabled
[11/17/2023-11:56:18] [I] Time Refit: Disabled
[11/17/2023-11:56:18] [I] Skip inference: Disabled
[11/17/2023-11:56:18] [I] Inputs:
[11/17/2023-11:56:18] [I] === Reporting Options ===
[11/17/2023-11:56:18] [I] Verbose: Disabled
[11/17/2023-11:56:18] [I] Averages: 10 inferences
[11/17/2023-11:56:18] [I] Percentile: 99
[11/17/2023-11:56:18] [I] Dump refittable layers:Disabled
[11/17/2023-11:56:18] [I] Dump output: Enabled
[11/17/2023-11:56:18] [I] Profile: Enabled
[11/17/2023-11:56:18] [I] Export timing to JSON file:
[11/17/2023-11:56:18] [I] Export output to JSON file:
[11/17/2023-11:56:18] [I] Export profile to JSON file:
[11/17/2023-11:56:18] [I]
[11/17/2023-11:56:18] [I] === Device Information ===
[11/17/2023-11:56:18] [I] Selected Device: Xavier
[11/17/2023-11:56:18] [I] Compute Capability: 7.2
[11/17/2023-11:56:18] [I] SMs: 6
[11/17/2023-11:56:18] [I] Compute Clock Rate: 1.109 GHz
[11/17/2023-11:56:18] [I] Device Global Memory: 7765 MiB
[11/17/2023-11:56:18] [I] Shared Memory per SM: 96 KiB
[11/17/2023-11:56:18] [I] Memory Bus Width: 256 bits (ECC disabled)
[11/17/2023-11:56:18] [I] Memory Clock Rate: 1.109 GHz
[11/17/2023-11:56:18] [I]
[11/17/2023-11:56:18] [I] TensorRT version: 8001
[11/17/2023-11:56:18] [I] Loading supplied plugin library: liblayerplugin.so
[11/17/2023-11:56:20] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 505, GPU 5471 (MiB)
[11/17/2023-11:56:20] [I] [TRT] Loaded engine size: 133 MB
[11/17/2023-11:56:20] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 505 MiB, GPU 5471 MiB
[11/17/2023-11:56:23] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +223, GPU +287, now: CPU 743, GPU 5906 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +399, now: CPU 1050, GPU 6305 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1050, GPU 6292 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1050 MiB, GPU 6292 MiB
[11/17/2023-11:56:25] [I] Engine loaded in 6.80832 sec.
[11/17/2023-11:56:25] [W] Profiler does not work when CUDA graph is enabled. Ignored --useCudaGraph flag and disabled CUDA graph.
[11/17/2023-11:56:25] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 917 MiB, GPU 6158 MiB
[11/17/2023-11:56:25] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +4, now: CPU 917, GPU 6162 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 917, GPU 6172 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 920 MiB, GPU 6371 MiB
[11/17/2023-11:56:25] [I] Created input binding for input with dimensions 3x608x608
[11/17/2023-11:56:25] [I] Created output binding for detections with dimensions 159201x1x1
[11/17/2023-11:56:25] [I] Starting inference
[11/17/2023-11:56:25] [E] Error[3]: [executionContext.cpp::enqueue::276] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueue::276, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 4, but engine max batch size was: 1

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

The model is the yolov4 model from onnx model zoo: GitHub - onnx/models: A collection of pre-trained, state-of-the-art models in the ONNX format

Running the check_model function produces no output with onnx 1.13.1

Running:
/usr/src/tensorrt/bin/trtexec --onnx=/data/models/yolov4_onnxmodelzoo.onnx --minShapes=input:1x3x416x416 --optShapes=input:16x3x416x416 --maxShapes=input:32x3x416x416 --shapes=input:5x3x416x416
gives:
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=/data/models/yolov4_onnxmodelzoo.onnx --minShapes=input:1x3x416x416 --optShapes=input:16x3x416x416 --maxShapes=input:32x3x416x416 --shapes=input:5x3x416x416

Yes, the engine should be created on same machine, however the onnx can be imported."
i see that the model has passed in the second iteration, Can you brief about the error.

Thanks