Tensorrt inference with batch > 1

kimsa0322 · July 26, 2022, 8:26am

Description

I want to trt inference with batching. Please look at simswapRuntrt2.py below.
In inference_engine(), trt_context.execute_async(batch_size=4, bindings=bindings, stream_handle=stream.handle) makes result all 0, with error.

[07/26/2022-08:09:43] [TRT] [E] 3: [executionContext.cpp::enqueue::284] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueue::284, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 4, but engine max batch size was: 1
)
All result must be :  78592.41
0.0
0.0
0.0
0.0

Also, trt_context.execute_async_v2(bindings=bindings, stream_handle=stream.handle) makes result quite different.

All result must be :  78592.41
108099.29
76659.984
67055.984
56882.566

I guess it’s because of get_engine(), or I have mistake when convert to trt model.
So please see my simswapRuntrt2.py code with model.

How to convert :

/workspace/tensorrt/bin/trtexec --explicitBatch --onnx=/workspace/simswap2trt/2trt/dynamic_folded.onnx --minShapes=x1:4x3x224x224,x2:4x512 --optShapes=x1:4x3x224x224,x2:4x512 --maxShapes=x1:4x3x224x224,x2:4x512 --saveEngine=/workspace/simswap2trt/2trt/dynamic_batch.plan

Thank you ;)

Environment

TensorRT Version: 8.2.4
GPU Type: RTX 2080 Ti
Nvidia Driver Version: 470.57.02
CUDA Version: 11.4
CUDNN Version: 8.4.0
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): Python 3.8.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.12.0+cu116
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorrt:22.04-py3

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

simswapRuntrt2.py (4.3 KB)

latend_input.npy - Google Drive, swap result0.npy - Google Drive, b_align_crop_tensor_input0.npy - Google Drive

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

NVES · July 26, 2022, 8:38am

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

kimsa0322 · July 26, 2022, 8:46am

Other files are all uploaed!
Here is trtexec result

trtexec --loadEngine=dynamic_batch.plan --batch=4 => fail.
trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512 => passed

root@9dd4dce9103b:/workspace/simswap2trt/2trt# /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4
&&&& RUNNING TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4
[07/26/2022-08:43:28] [I] === Model Options ===
[07/26/2022-08:43:28] [I] Format: *
[07/26/2022-08:43:28] [I] Model: 
[07/26/2022-08:43:28] [I] Output:
[07/26/2022-08:43:28] [I] === Build Options ===
[07/26/2022-08:43:28] [I] Max batch: 4
[07/26/2022-08:43:28] [I] Workspace: 16 MiB
[07/26/2022-08:43:28] [I] minTiming: 1
[07/26/2022-08:43:28] [I] avgTiming: 8
[07/26/2022-08:43:28] [I] Precision: FP32
[07/26/2022-08:43:28] [I] Calibration: 
[07/26/2022-08:43:28] [I] Refit: Disabled
[07/26/2022-08:43:28] [I] Sparsity: Disabled
[07/26/2022-08:43:28] [I] Safe mode: Disabled
[07/26/2022-08:43:28] [I] DirectIO mode: Disabled
[07/26/2022-08:43:28] [I] Restricted mode: Disabled
[07/26/2022-08:43:28] [I] Save engine: 
[07/26/2022-08:43:28] [I] Load engine: dynamic_batch.plan
[07/26/2022-08:43:28] [I] Profiling verbosity: 0
[07/26/2022-08:43:28] [I] Tactic sources: Using default tactic sources
[07/26/2022-08:43:28] [I] timingCacheMode: local
[07/26/2022-08:43:28] [I] timingCacheFile: 
[07/26/2022-08:43:28] [I] Input(s)s format: fp32:CHW
[07/26/2022-08:43:28] [I] Output(s)s format: fp32:CHW
[07/26/2022-08:43:28] [I] Input build shapes: model
[07/26/2022-08:43:28] [I] Input calibration shapes: model
[07/26/2022-08:43:28] [I] === System Options ===
[07/26/2022-08:43:28] [I] Device: 0
[07/26/2022-08:43:28] [I] DLACore: 
[07/26/2022-08:43:28] [I] Plugins:
[07/26/2022-08:43:28] [I] === Inference Options ===
[07/26/2022-08:43:28] [I] Batch: 4
[07/26/2022-08:43:28] [I] Input inference shapes: model
[07/26/2022-08:43:28] [I] Iterations: 10
[07/26/2022-08:43:28] [I] Duration: 3s (+ 200ms warm up)
[07/26/2022-08:43:28] [I] Sleep time: 0ms
[07/26/2022-08:43:28] [I] Idle time: 0ms
[07/26/2022-08:43:28] [I] Streams: 1
[07/26/2022-08:43:28] [I] ExposeDMA: Disabled
[07/26/2022-08:43:28] [I] Data transfers: Enabled
[07/26/2022-08:43:28] [I] Spin-wait: Disabled
[07/26/2022-08:43:28] [I] Multithreading: Disabled
[07/26/2022-08:43:28] [I] CUDA Graph: Disabled
[07/26/2022-08:43:28] [I] Separate profiling: Disabled
[07/26/2022-08:43:28] [I] Time Deserialize: Disabled
[07/26/2022-08:43:28] [I] Time Refit: Disabled
[07/26/2022-08:43:28] [I] Skip inference: Disabled
[07/26/2022-08:43:28] [I] Inputs:
[07/26/2022-08:43:28] [I] === Reporting Options ===
[07/26/2022-08:43:28] [I] Verbose: Disabled
[07/26/2022-08:43:28] [I] Averages: 10 inferences
[07/26/2022-08:43:28] [I] Percentile: 99
[07/26/2022-08:43:28] [I] Dump refittable layers:Disabled
[07/26/2022-08:43:28] [I] Dump output: Disabled
[07/26/2022-08:43:28] [I] Profile: Disabled
[07/26/2022-08:43:28] [I] Export timing to JSON file: 
[07/26/2022-08:43:28] [I] Export output to JSON file: 
[07/26/2022-08:43:28] [I] Export profile to JSON file: 
[07/26/2022-08:43:28] [I] 
[07/26/2022-08:43:28] [I] === Device Information ===
[07/26/2022-08:43:28] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[07/26/2022-08:43:28] [I] Compute Capability: 7.5
[07/26/2022-08:43:28] [I] SMs: 68
[07/26/2022-08:43:28] [I] Compute Clock Rate: 1.65 GHz
[07/26/2022-08:43:28] [I] Device Global Memory: 11011 MiB
[07/26/2022-08:43:28] [I] Shared Memory per SM: 64 KiB
[07/26/2022-08:43:28] [I] Memory Bus Width: 352 bits (ECC disabled)
[07/26/2022-08:43:28] [I] Memory Clock Rate: 7 GHz
[07/26/2022-08:43:28] [I] 
[07/26/2022-08:43:28] [I] TensorRT version: 8.2.4
[07/26/2022-08:43:28] [I] [TRT] [MemUsageChange] Init CUDA: CPU +321, GPU +0, now: CPU 879, GPU 870 (MiB)
[07/26/2022-08:43:28] [I] [TRT] Loaded engine size: 545 MiB
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +513, GPU +222, now: CPU 1431, GPU 1604 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +116, GPU +54, now: CPU 1547, GPU 1658 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +508, now: CPU 0, GPU 508 (MiB)
[07/26/2022-08:43:29] [I] Engine loaded in 0.994565 sec.
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1001, GPU 1650 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1001, GPU 1658 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +627, now: CPU 0, GPU 1135 (MiB)
[07/26/2022-08:43:29] [I] Using random values for input x1
[07/26/2022-08:43:29] [I] Created input binding for x1 with dimensions 4x3x224x224
[07/26/2022-08:43:29] [I] Using random values for input x2
[07/26/2022-08:43:29] [I] Created input binding for x2 with dimensions 4x512
[07/26/2022-08:43:29] [I] Using random values for output outputs
[07/26/2022-08:43:29] [I] Created output binding for outputs with dimensions 4x3x224x224
[07/26/2022-08:43:29] [I] Starting inference
[07/26/2022-08:43:29] [E] Error[3]: [executionContext.cpp::enqueue::284] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueue::284, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 4, but engine max batch size was: 1
)
[07/26/2022-08:43:29] [E] Error occurred during inference
&&&& FAILED TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4

root@9dd4dce9103b:/workspace/simswap2trt/2trt# /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512
&&&& RUNNING TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512
[07/26/2022-08:44:01] [I] === Model Options ===
[07/26/2022-08:44:01] [I] Format: *
[07/26/2022-08:44:01] [I] Model: 
[07/26/2022-08:44:01] [I] Output:
[07/26/2022-08:44:01] [I] === Build Options ===
[07/26/2022-08:44:01] [I] Max batch: explicit batch
[07/26/2022-08:44:01] [I] Workspace: 16 MiB
[07/26/2022-08:44:01] [I] minTiming: 1
[07/26/2022-08:44:01] [I] avgTiming: 8
[07/26/2022-08:44:01] [I] Precision: FP32
[07/26/2022-08:44:01] [I] Calibration: 
[07/26/2022-08:44:01] [I] Refit: Disabled
[07/26/2022-08:44:01] [I] Sparsity: Disabled
[07/26/2022-08:44:01] [I] Safe mode: Disabled
[07/26/2022-08:44:01] [I] DirectIO mode: Disabled
[07/26/2022-08:44:01] [I] Restricted mode: Disabled
[07/26/2022-08:44:01] [I] Save engine: 
[07/26/2022-08:44:01] [I] Load engine: dynamic_batch.plan
[07/26/2022-08:44:01] [I] Profiling verbosity: 0
[07/26/2022-08:44:01] [I] Tactic sources: Using default tactic sources
[07/26/2022-08:44:01] [I] timingCacheMode: local
[07/26/2022-08:44:01] [I] timingCacheFile: 
[07/26/2022-08:44:01] [I] Input(s)s format: fp32:CHW
[07/26/2022-08:44:01] [I] Output(s)s format: fp32:CHW
[07/26/2022-08:44:01] [I] Input build shape: x1=4x3x224x224+4x3x224x224+4x3x224x224
[07/26/2022-08:44:01] [I] Input build shape: x2=4x512+4x512+4x512
[07/26/2022-08:44:01] [I] Input calibration shapes: model
[07/26/2022-08:44:01] [I] === System Options ===
[07/26/2022-08:44:01] [I] Device: 0
[07/26/2022-08:44:01] [I] DLACore: 
[07/26/2022-08:44:01] [I] Plugins:
[07/26/2022-08:44:01] [I] === Inference Options ===
[07/26/2022-08:44:01] [I] Batch: Explicit
[07/26/2022-08:44:01] [I] Input inference shape: x2=4x512
[07/26/2022-08:44:01] [I] Input inference shape: x1=4x3x224x224
[07/26/2022-08:44:01] [I] Iterations: 10
[07/26/2022-08:44:01] [I] Duration: 3s (+ 200ms warm up)
[07/26/2022-08:44:01] [I] Sleep time: 0ms
[07/26/2022-08:44:01] [I] Idle time: 0ms
[07/26/2022-08:44:01] [I] Streams: 1
[07/26/2022-08:44:01] [I] ExposeDMA: Disabled
[07/26/2022-08:44:01] [I] Data transfers: Enabled
[07/26/2022-08:44:01] [I] Spin-wait: Disabled
[07/26/2022-08:44:01] [I] Multithreading: Disabled
[07/26/2022-08:44:01] [I] CUDA Graph: Disabled
[07/26/2022-08:44:01] [I] Separate profiling: Disabled
[07/26/2022-08:44:01] [I] Time Deserialize: Disabled
[07/26/2022-08:44:01] [I] Time Refit: Disabled
[07/26/2022-08:44:01] [I] Skip inference: Disabled
[07/26/2022-08:44:01] [I] Inputs:
[07/26/2022-08:44:01] [I] === Reporting Options ===
[07/26/2022-08:44:01] [I] Verbose: Disabled
[07/26/2022-08:44:01] [I] Averages: 10 inferences
[07/26/2022-08:44:01] [I] Percentile: 99
[07/26/2022-08:44:01] [I] Dump refittable layers:Disabled
[07/26/2022-08:44:01] [I] Dump output: Disabled
[07/26/2022-08:44:01] [I] Profile: Disabled
[07/26/2022-08:44:01] [I] Export timing to JSON file: 
[07/26/2022-08:44:01] [I] Export output to JSON file: 
[07/26/2022-08:44:01] [I] Export profile to JSON file: 
[07/26/2022-08:44:01] [I] 
[07/26/2022-08:44:01] [I] === Device Information ===
[07/26/2022-08:44:01] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[07/26/2022-08:44:01] [I] Compute Capability: 7.5
[07/26/2022-08:44:01] [I] SMs: 68
[07/26/2022-08:44:01] [I] Compute Clock Rate: 1.65 GHz
[07/26/2022-08:44:01] [I] Device Global Memory: 11011 MiB
[07/26/2022-08:44:01] [I] Shared Memory per SM: 64 KiB
[07/26/2022-08:44:01] [I] Memory Bus Width: 352 bits (ECC disabled)
[07/26/2022-08:44:01] [I] Memory Clock Rate: 7 GHz
[07/26/2022-08:44:01] [I] 
[07/26/2022-08:44:01] [I] TensorRT version: 8.2.4
[07/26/2022-08:44:01] [I] [TRT] [MemUsageChange] Init CUDA: CPU +321, GPU +0, now: CPU 879, GPU 870 (MiB)
[07/26/2022-08:44:01] [I] [TRT] Loaded engine size: 545 MiB
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +513, GPU +222, now: CPU 1431, GPU 1604 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +116, GPU +54, now: CPU 1547, GPU 1658 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +508, now: CPU 0, GPU 508 (MiB)
[07/26/2022-08:44:02] [I] Engine loaded in 0.983595 sec.
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1001, GPU 1650 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1001, GPU 1658 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +627, now: CPU 0, GPU 1135 (MiB)
[07/26/2022-08:44:02] [I] Using random values for input x1
[07/26/2022-08:44:02] [I] Created input binding for x1 with dimensions 4x3x224x224
[07/26/2022-08:44:02] [I] Using random values for input x2
[07/26/2022-08:44:02] [I] Created input binding for x2 with dimensions 4x512
[07/26/2022-08:44:02] [I] Using random values for output outputs
[07/26/2022-08:44:02] [I] Created output binding for outputs with dimensions 4x3x224x224
[07/26/2022-08:44:02] [I] Starting inference
[07/26/2022-08:44:05] [I] Warmup completed 1 queries over 200 ms
[07/26/2022-08:44:05] [I] Timing trace has 93 queries over 2.88673 s
[07/26/2022-08:44:05] [I] 
[07/26/2022-08:44:05] [I] === Trace details ===
[07/26/2022-08:44:05] [I] Trace averages of 10 runs:
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 34.3636 ms - Host latency: 34.9215 ms (end to end 65.277 ms, enqueue 2.94725 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.1534 ms - Host latency: 30.6778 ms (end to end 59.8989 ms, enqueue 3.40982 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.2245 ms - Host latency: 30.7502 ms (end to end 60.2467 ms, enqueue 3.50318 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.2769 ms - Host latency: 30.8062 ms (end to end 59.5009 ms, enqueue 3.50552 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.4659 ms - Host latency: 30.9747 ms (end to end 60.7813 ms, enqueue 2.64545 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.508 ms - Host latency: 31.0339 ms (end to end 60.8048 ms, enqueue 3.31053 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 32.119 ms - Host latency: 32.6373 ms (end to end 64.0943 ms, enqueue 1.74031 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.4284 ms - Host latency: 30.949 ms (end to end 60.6614 ms, enqueue 2.69846 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.6439 ms - Host latency: 31.1798 ms (end to end 61.0422 ms, enqueue 3.30291 ms)
[07/26/2022-08:44:05] [I] 
[07/26/2022-08:44:05] [I] === Performance summary ===
[07/26/2022-08:44:05] [I] Throughput: 32.2164 qps
[07/26/2022-08:44:05] [I] Latency: min = 30.0452 ms, max = 38.9216 ms, mean = 31.5264 ms, median = 30.9492 ms, percentile(99%) = 38.9216 ms
[07/26/2022-08:44:05] [I] End-to-End Host Latency: min = 42.0844 ms, max = 76.3016 ms, mean = 61.3436 ms, median = 60.593 ms, percentile(99%) = 76.3016 ms
[07/26/2022-08:44:05] [I] Enqueue Time: min = 0.68335 ms, max = 4.70483 ms, mean = 3.02451 ms, median = 3.21851 ms, percentile(99%) = 4.70483 ms
[07/26/2022-08:44:05] [I] H2D Latency: min = 0.242432 ms, max = 0.47583 ms, mean = 0.313639 ms, median = 0.308838 ms, percentile(99%) = 0.47583 ms
[07/26/2022-08:44:05] [I] GPU Compute Time: min = 29.5245 ms, max = 38.4528 ms, mean = 30.999 ms, median = 30.438 ms, percentile(99%) = 38.4528 ms
[07/26/2022-08:44:05] [I] D2H Latency: min = 0.186035 ms, max = 0.218506 ms, mean = 0.213687 ms, median = 0.214111 ms, percentile(99%) = 0.218506 ms
[07/26/2022-08:44:05] [I] Total Host Walltime: 2.88673 s
[07/26/2022-08:44:05] [I] Total GPU Compute Time: 2.88291 s
[07/26/2022-08:44:05] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/26/2022-08:44:05] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512

spolisetty · August 3, 2022, 2:30pm

Hi,

We recommend you to please use the latest TensorRT version 8.4 GA.
We couldn’t reproduce the issue using the latest TensorRT version.

&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /opt/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4

Thank you.

2510294685 · October 13, 2022, 6:20am

I get same erro, can you resolve it? thanks

Topic		Replies	Views
TensorRT Inference error on Jetson nano Jetson Nano tensorrt	28	2907	February 1, 2022
ConvTranspose + Add Slow TensorRT tensorrt	4	658	July 25, 2023
Error loading .trt model Jetson AGX Orin tensorrt	7	161	November 6, 2024
Trt with batch TensorRT	4	630	July 27, 2022
Tensorrt Inference Segmentation fault TensorRT tensorrt , cudnn	6	332	June 5, 2024
TensorRT inference process TensorRT	4	642	May 17, 2021
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1024	September 28, 2022
AssertionError: Max workspace size for TensorRT inference should be positive, got 0 TensorRT	4	733	July 21, 2021
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	764	March 13, 2023
TensorRT model inference fully on DLA is slow due to abnormally slow cudaEventSynchronize time Jetson AGX Orin tensorrt , cuda , dla	10	1554	January 17, 2024

Tensorrt inference with batch > 1

Description

Environment

Relevant Files

Steps To Reproduce

Related topics