Description
I’m looking to convert a yolov4 model from Onnx model zoo to tensorflow using TensorRT for use in Deepstream.
I’m looking to use this for streaming from multiple sources and so I want to convert it to use a batch size >1. I also have a question about the process:
- Do model .engine files need to be created on the device they are intended to be used on? We are looking to deploy on Jetson NX’s and are using a build server, but this post (GitHub - isarsoft/yolov4-triton-tensorrt: This repository deploys YOLOv4 as an optimized TensorRT engine to Triton Inference Server) seems to imply you can build on an x86 server and push out to a jetson device
Environment
TensorRT Version: v8001
GPU Type: Jetson NX
Nvidia Driver Version: r32.6.1
CUDA Version: 10.2
CUDNN Version:
Operating System + Version: Ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): r32.6.1-samples
Steps To Reproduce
Download yolov4 onnx file from onnx model zoo
Run trtexec for multiple batch size (from GitHub - isarsoft/yolov4-triton-tensorrt: This repository deploys YOLOv4 as an optimized TensorRT engine to Triton Inference Server)
Test with:
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov4.engine --batch=4 --iterations=100 --avgRuns=10 --dumpProfile --dumpOutput --useCudaGraph
and I see the output:
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov4.engine --batch=4 --iterations=100 --avgRuns=10 --dumpProfile --dumpOutput --useCudaGraph
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --loadEngine=yolov4.engine --batch=4 --iterations=100 --avgRuns=10 --dumpProfile --dumpOutput --useCudaGraph --plugins=liblayerplugin.so
[11/17/2023-11:56:18] [I] === Model Options ===
[11/17/2023-11:56:18] [I] Format: *
[11/17/2023-11:56:18] [I] Model:
[11/17/2023-11:56:18] [I] Output:
[11/17/2023-11:56:18] [I] === Build Options ===
[11/17/2023-11:56:18] [I] Max batch: 4
[11/17/2023-11:56:18] [I] Workspace: 16 MiB
[11/17/2023-11:56:18] [I] minTiming: 1
[11/17/2023-11:56:18] [I] avgTiming: 8
[11/17/2023-11:56:18] [I] Precision: FP32
[11/17/2023-11:56:18] [I] Calibration:
[11/17/2023-11:56:18] [I] Refit: Disabled
[11/17/2023-11:56:18] [I] Sparsity: Disabled
[11/17/2023-11:56:18] [I] Safe mode: Disabled
[11/17/2023-11:56:18] [I] Restricted mode: Disabled
[11/17/2023-11:56:18] [I] Save engine:
[11/17/2023-11:56:18] [I] Load engine: yolov4.engine
[11/17/2023-11:56:18] [I] NVTX verbosity: 0
[11/17/2023-11:56:18] [I] Tactic sources: Using default tactic sources
[11/17/2023-11:56:18] [I] timingCacheMode: local
[11/17/2023-11:56:18] [I] timingCacheFile:
[11/17/2023-11:56:18] [I] Input(s)s format: fp32:CHW
[11/17/2023-11:56:18] [I] Output(s)s format: fp32:CHW
[11/17/2023-11:56:18] [I] Input build shapes: model
[11/17/2023-11:56:18] [I] Input calibration shapes: model
[11/17/2023-11:56:18] [I] === System Options ===
[11/17/2023-11:56:18] [I] Device: 0
[11/17/2023-11:56:18] [I] DLACore:
[11/17/2023-11:56:18] [I] Plugins: liblayerplugin.so
[11/17/2023-11:56:18] [I] === Inference Options ===
[11/17/2023-11:56:18] [I] Batch: 4
[11/17/2023-11:56:18] [I] Input inference shapes: model
[11/17/2023-11:56:18] [I] Iterations: 100
[11/17/2023-11:56:18] [I] Duration: 3s (+ 200ms warm up)
[11/17/2023-11:56:18] [I] Sleep time: 0ms
[11/17/2023-11:56:18] [I] Streams: 1
[11/17/2023-11:56:18] [I] ExposeDMA: Disabled
[11/17/2023-11:56:18] [I] Data transfers: Enabled
[11/17/2023-11:56:18] [I] Spin-wait: Disabled
[11/17/2023-11:56:18] [I] Multithreading: Disabled
[11/17/2023-11:56:18] [I] CUDA Graph: Enabled
[11/17/2023-11:56:18] [I] Separate profiling: Disabled
[11/17/2023-11:56:18] [I] Time Deserialize: Disabled
[11/17/2023-11:56:18] [I] Time Refit: Disabled
[11/17/2023-11:56:18] [I] Skip inference: Disabled
[11/17/2023-11:56:18] [I] Inputs:
[11/17/2023-11:56:18] [I] === Reporting Options ===
[11/17/2023-11:56:18] [I] Verbose: Disabled
[11/17/2023-11:56:18] [I] Averages: 10 inferences
[11/17/2023-11:56:18] [I] Percentile: 99
[11/17/2023-11:56:18] [I] Dump refittable layers:Disabled
[11/17/2023-11:56:18] [I] Dump output: Enabled
[11/17/2023-11:56:18] [I] Profile: Enabled
[11/17/2023-11:56:18] [I] Export timing to JSON file:
[11/17/2023-11:56:18] [I] Export output to JSON file:
[11/17/2023-11:56:18] [I] Export profile to JSON file:
[11/17/2023-11:56:18] [I]
[11/17/2023-11:56:18] [I] === Device Information ===
[11/17/2023-11:56:18] [I] Selected Device: Xavier
[11/17/2023-11:56:18] [I] Compute Capability: 7.2
[11/17/2023-11:56:18] [I] SMs: 6
[11/17/2023-11:56:18] [I] Compute Clock Rate: 1.109 GHz
[11/17/2023-11:56:18] [I] Device Global Memory: 7765 MiB
[11/17/2023-11:56:18] [I] Shared Memory per SM: 96 KiB
[11/17/2023-11:56:18] [I] Memory Bus Width: 256 bits (ECC disabled)
[11/17/2023-11:56:18] [I] Memory Clock Rate: 1.109 GHz
[11/17/2023-11:56:18] [I]
[11/17/2023-11:56:18] [I] TensorRT version: 8001
[11/17/2023-11:56:18] [I] Loading supplied plugin library: liblayerplugin.so
[11/17/2023-11:56:20] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 505, GPU 5471 (MiB)
[11/17/2023-11:56:20] [I] [TRT] Loaded engine size: 133 MB
[11/17/2023-11:56:20] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 505 MiB, GPU 5471 MiB
[11/17/2023-11:56:23] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +223, GPU +287, now: CPU 743, GPU 5906 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +399, now: CPU 1050, GPU 6305 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1050, GPU 6292 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1050 MiB, GPU 6292 MiB
[11/17/2023-11:56:25] [I] Engine loaded in 6.80832 sec.
[11/17/2023-11:56:25] [W] Profiler does not work when CUDA graph is enabled. Ignored --useCudaGraph flag and disabled CUDA graph.
[11/17/2023-11:56:25] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 917 MiB, GPU 6158 MiB
[11/17/2023-11:56:25] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +4, now: CPU 917, GPU 6162 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 917, GPU 6172 (MiB)
[11/17/2023-11:56:25] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 920 MiB, GPU 6371 MiB
[11/17/2023-11:56:25] [I] Created input binding for input with dimensions 3x608x608
[11/17/2023-11:56:25] [I] Created output binding for detections with dimensions 159201x1x1
[11/17/2023-11:56:25] [I] Starting inference
[11/17/2023-11:56:25] [E] Error[3]: [executionContext.cpp::enqueue::276] Error Code 3: Internal Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueue::276, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 4, but engine max batch size was: 1