Tensorrt inference with batch > 1

Description

I want to trt inference with batching. Please look at simswapRuntrt2.py below.
In inference_engine(), trt_context.execute_async(batch_size=4, bindings=bindings, stream_handle=stream.handle) makes result all 0, with error.

[07/26/2022-08:09:43] [TRT] [E] 3: [executionContext.cpp::enqueue::284] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueue::284, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 4, but engine max batch size was: 1
)
All result must be :  78592.41
0.0
0.0
0.0
0.0

Also, trt_context.execute_async_v2(bindings=bindings, stream_handle=stream.handle) makes result quite different.

All result must be :  78592.41
108099.29
76659.984
67055.984
56882.566

I guess it’s because of get_engine(), or I have mistake when convert to trt model.
So please see my simswapRuntrt2.py code with model.

How to convert :

/workspace/tensorrt/bin/trtexec --explicitBatch --onnx=/workspace/simswap2trt/2trt/dynamic_folded.onnx --minShapes=x1:4x3x224x224,x2:4x512 --optShapes=x1:4x3x224x224,x2:4x512 --maxShapes=x1:4x3x224x224,x2:4x512 --saveEngine=/workspace/simswap2trt/2trt/dynamic_batch.plan

Thank you ;)

Environment

TensorRT Version: 8.2.4
GPU Type: RTX 2080 Ti
Nvidia Driver Version: 470.57.02
CUDA Version: 11.4
CUDNN Version: 8.4.0
Operating System + Version: Ubuntu 20.04
Python Version (if applicable): Python 3.8.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.12.0+cu116
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorrt:22.04-py3

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

simswapRuntrt2.py (4.3 KB)

latend_input.npy - Google Drive, swap result0.npy - Google Drive, b_align_crop_tensor_input0.npy - Google Drive

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

Other files are all uploaed!
Here is trtexec result

  1. trtexec --loadEngine=dynamic_batch.plan --batch=4 => fail.
  2. trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512 => passed
root@9dd4dce9103b:/workspace/simswap2trt/2trt# /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4
&&&& RUNNING TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4
[07/26/2022-08:43:28] [I] === Model Options ===
[07/26/2022-08:43:28] [I] Format: *
[07/26/2022-08:43:28] [I] Model: 
[07/26/2022-08:43:28] [I] Output:
[07/26/2022-08:43:28] [I] === Build Options ===
[07/26/2022-08:43:28] [I] Max batch: 4
[07/26/2022-08:43:28] [I] Workspace: 16 MiB
[07/26/2022-08:43:28] [I] minTiming: 1
[07/26/2022-08:43:28] [I] avgTiming: 8
[07/26/2022-08:43:28] [I] Precision: FP32
[07/26/2022-08:43:28] [I] Calibration: 
[07/26/2022-08:43:28] [I] Refit: Disabled
[07/26/2022-08:43:28] [I] Sparsity: Disabled
[07/26/2022-08:43:28] [I] Safe mode: Disabled
[07/26/2022-08:43:28] [I] DirectIO mode: Disabled
[07/26/2022-08:43:28] [I] Restricted mode: Disabled
[07/26/2022-08:43:28] [I] Save engine: 
[07/26/2022-08:43:28] [I] Load engine: dynamic_batch.plan
[07/26/2022-08:43:28] [I] Profiling verbosity: 0
[07/26/2022-08:43:28] [I] Tactic sources: Using default tactic sources
[07/26/2022-08:43:28] [I] timingCacheMode: local
[07/26/2022-08:43:28] [I] timingCacheFile: 
[07/26/2022-08:43:28] [I] Input(s)s format: fp32:CHW
[07/26/2022-08:43:28] [I] Output(s)s format: fp32:CHW
[07/26/2022-08:43:28] [I] Input build shapes: model
[07/26/2022-08:43:28] [I] Input calibration shapes: model
[07/26/2022-08:43:28] [I] === System Options ===
[07/26/2022-08:43:28] [I] Device: 0
[07/26/2022-08:43:28] [I] DLACore: 
[07/26/2022-08:43:28] [I] Plugins:
[07/26/2022-08:43:28] [I] === Inference Options ===
[07/26/2022-08:43:28] [I] Batch: 4
[07/26/2022-08:43:28] [I] Input inference shapes: model
[07/26/2022-08:43:28] [I] Iterations: 10
[07/26/2022-08:43:28] [I] Duration: 3s (+ 200ms warm up)
[07/26/2022-08:43:28] [I] Sleep time: 0ms
[07/26/2022-08:43:28] [I] Idle time: 0ms
[07/26/2022-08:43:28] [I] Streams: 1
[07/26/2022-08:43:28] [I] ExposeDMA: Disabled
[07/26/2022-08:43:28] [I] Data transfers: Enabled
[07/26/2022-08:43:28] [I] Spin-wait: Disabled
[07/26/2022-08:43:28] [I] Multithreading: Disabled
[07/26/2022-08:43:28] [I] CUDA Graph: Disabled
[07/26/2022-08:43:28] [I] Separate profiling: Disabled
[07/26/2022-08:43:28] [I] Time Deserialize: Disabled
[07/26/2022-08:43:28] [I] Time Refit: Disabled
[07/26/2022-08:43:28] [I] Skip inference: Disabled
[07/26/2022-08:43:28] [I] Inputs:
[07/26/2022-08:43:28] [I] === Reporting Options ===
[07/26/2022-08:43:28] [I] Verbose: Disabled
[07/26/2022-08:43:28] [I] Averages: 10 inferences
[07/26/2022-08:43:28] [I] Percentile: 99
[07/26/2022-08:43:28] [I] Dump refittable layers:Disabled
[07/26/2022-08:43:28] [I] Dump output: Disabled
[07/26/2022-08:43:28] [I] Profile: Disabled
[07/26/2022-08:43:28] [I] Export timing to JSON file: 
[07/26/2022-08:43:28] [I] Export output to JSON file: 
[07/26/2022-08:43:28] [I] Export profile to JSON file: 
[07/26/2022-08:43:28] [I] 
[07/26/2022-08:43:28] [I] === Device Information ===
[07/26/2022-08:43:28] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[07/26/2022-08:43:28] [I] Compute Capability: 7.5
[07/26/2022-08:43:28] [I] SMs: 68
[07/26/2022-08:43:28] [I] Compute Clock Rate: 1.65 GHz
[07/26/2022-08:43:28] [I] Device Global Memory: 11011 MiB
[07/26/2022-08:43:28] [I] Shared Memory per SM: 64 KiB
[07/26/2022-08:43:28] [I] Memory Bus Width: 352 bits (ECC disabled)
[07/26/2022-08:43:28] [I] Memory Clock Rate: 7 GHz
[07/26/2022-08:43:28] [I] 
[07/26/2022-08:43:28] [I] TensorRT version: 8.2.4
[07/26/2022-08:43:28] [I] [TRT] [MemUsageChange] Init CUDA: CPU +321, GPU +0, now: CPU 879, GPU 870 (MiB)
[07/26/2022-08:43:28] [I] [TRT] Loaded engine size: 545 MiB
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +513, GPU +222, now: CPU 1431, GPU 1604 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +116, GPU +54, now: CPU 1547, GPU 1658 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +508, now: CPU 0, GPU 508 (MiB)
[07/26/2022-08:43:29] [I] Engine loaded in 0.994565 sec.
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1001, GPU 1650 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1001, GPU 1658 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +627, now: CPU 0, GPU 1135 (MiB)
[07/26/2022-08:43:29] [I] Using random values for input x1
[07/26/2022-08:43:29] [I] Created input binding for x1 with dimensions 4x3x224x224
[07/26/2022-08:43:29] [I] Using random values for input x2
[07/26/2022-08:43:29] [I] Created input binding for x2 with dimensions 4x512
[07/26/2022-08:43:29] [I] Using random values for output outputs
[07/26/2022-08:43:29] [I] Created output binding for outputs with dimensions 4x3x224x224
[07/26/2022-08:43:29] [I] Starting inference
[07/26/2022-08:43:29] [E] Error[3]: [executionContext.cpp::enqueue::284] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueue::284, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 4, but engine max batch size was: 1
)
[07/26/2022-08:43:29] [E] Error occurred during inference
&&&& FAILED TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4

root@9dd4dce9103b:/workspace/simswap2trt/2trt# /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512
&&&& RUNNING TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512
[07/26/2022-08:44:01] [I] === Model Options ===
[07/26/2022-08:44:01] [I] Format: *
[07/26/2022-08:44:01] [I] Model: 
[07/26/2022-08:44:01] [I] Output:
[07/26/2022-08:44:01] [I] === Build Options ===
[07/26/2022-08:44:01] [I] Max batch: explicit batch
[07/26/2022-08:44:01] [I] Workspace: 16 MiB
[07/26/2022-08:44:01] [I] minTiming: 1
[07/26/2022-08:44:01] [I] avgTiming: 8
[07/26/2022-08:44:01] [I] Precision: FP32
[07/26/2022-08:44:01] [I] Calibration: 
[07/26/2022-08:44:01] [I] Refit: Disabled
[07/26/2022-08:44:01] [I] Sparsity: Disabled
[07/26/2022-08:44:01] [I] Safe mode: Disabled
[07/26/2022-08:44:01] [I] DirectIO mode: Disabled
[07/26/2022-08:44:01] [I] Restricted mode: Disabled
[07/26/2022-08:44:01] [I] Save engine: 
[07/26/2022-08:44:01] [I] Load engine: dynamic_batch.plan
[07/26/2022-08:44:01] [I] Profiling verbosity: 0
[07/26/2022-08:44:01] [I] Tactic sources: Using default tactic sources
[07/26/2022-08:44:01] [I] timingCacheMode: local
[07/26/2022-08:44:01] [I] timingCacheFile: 
[07/26/2022-08:44:01] [I] Input(s)s format: fp32:CHW
[07/26/2022-08:44:01] [I] Output(s)s format: fp32:CHW
[07/26/2022-08:44:01] [I] Input build shape: x1=4x3x224x224+4x3x224x224+4x3x224x224
[07/26/2022-08:44:01] [I] Input build shape: x2=4x512+4x512+4x512
[07/26/2022-08:44:01] [I] Input calibration shapes: model
[07/26/2022-08:44:01] [I] === System Options ===
[07/26/2022-08:44:01] [I] Device: 0
[07/26/2022-08:44:01] [I] DLACore: 
[07/26/2022-08:44:01] [I] Plugins:
[07/26/2022-08:44:01] [I] === Inference Options ===
[07/26/2022-08:44:01] [I] Batch: Explicit
[07/26/2022-08:44:01] [I] Input inference shape: x2=4x512
[07/26/2022-08:44:01] [I] Input inference shape: x1=4x3x224x224
[07/26/2022-08:44:01] [I] Iterations: 10
[07/26/2022-08:44:01] [I] Duration: 3s (+ 200ms warm up)
[07/26/2022-08:44:01] [I] Sleep time: 0ms
[07/26/2022-08:44:01] [I] Idle time: 0ms
[07/26/2022-08:44:01] [I] Streams: 1
[07/26/2022-08:44:01] [I] ExposeDMA: Disabled
[07/26/2022-08:44:01] [I] Data transfers: Enabled
[07/26/2022-08:44:01] [I] Spin-wait: Disabled
[07/26/2022-08:44:01] [I] Multithreading: Disabled
[07/26/2022-08:44:01] [I] CUDA Graph: Disabled
[07/26/2022-08:44:01] [I] Separate profiling: Disabled
[07/26/2022-08:44:01] [I] Time Deserialize: Disabled
[07/26/2022-08:44:01] [I] Time Refit: Disabled
[07/26/2022-08:44:01] [I] Skip inference: Disabled
[07/26/2022-08:44:01] [I] Inputs:
[07/26/2022-08:44:01] [I] === Reporting Options ===
[07/26/2022-08:44:01] [I] Verbose: Disabled
[07/26/2022-08:44:01] [I] Averages: 10 inferences
[07/26/2022-08:44:01] [I] Percentile: 99
[07/26/2022-08:44:01] [I] Dump refittable layers:Disabled
[07/26/2022-08:44:01] [I] Dump output: Disabled
[07/26/2022-08:44:01] [I] Profile: Disabled
[07/26/2022-08:44:01] [I] Export timing to JSON file: 
[07/26/2022-08:44:01] [I] Export output to JSON file: 
[07/26/2022-08:44:01] [I] Export profile to JSON file: 
[07/26/2022-08:44:01] [I] 
[07/26/2022-08:44:01] [I] === Device Information ===
[07/26/2022-08:44:01] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[07/26/2022-08:44:01] [I] Compute Capability: 7.5
[07/26/2022-08:44:01] [I] SMs: 68
[07/26/2022-08:44:01] [I] Compute Clock Rate: 1.65 GHz
[07/26/2022-08:44:01] [I] Device Global Memory: 11011 MiB
[07/26/2022-08:44:01] [I] Shared Memory per SM: 64 KiB
[07/26/2022-08:44:01] [I] Memory Bus Width: 352 bits (ECC disabled)
[07/26/2022-08:44:01] [I] Memory Clock Rate: 7 GHz
[07/26/2022-08:44:01] [I] 
[07/26/2022-08:44:01] [I] TensorRT version: 8.2.4
[07/26/2022-08:44:01] [I] [TRT] [MemUsageChange] Init CUDA: CPU +321, GPU +0, now: CPU 879, GPU 870 (MiB)
[07/26/2022-08:44:01] [I] [TRT] Loaded engine size: 545 MiB
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +513, GPU +222, now: CPU 1431, GPU 1604 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +116, GPU +54, now: CPU 1547, GPU 1658 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +508, now: CPU 0, GPU 508 (MiB)
[07/26/2022-08:44:02] [I] Engine loaded in 0.983595 sec.
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1001, GPU 1650 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1001, GPU 1658 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +627, now: CPU 0, GPU 1135 (MiB)
[07/26/2022-08:44:02] [I] Using random values for input x1
[07/26/2022-08:44:02] [I] Created input binding for x1 with dimensions 4x3x224x224
[07/26/2022-08:44:02] [I] Using random values for input x2
[07/26/2022-08:44:02] [I] Created input binding for x2 with dimensions 4x512
[07/26/2022-08:44:02] [I] Using random values for output outputs
[07/26/2022-08:44:02] [I] Created output binding for outputs with dimensions 4x3x224x224
[07/26/2022-08:44:02] [I] Starting inference
[07/26/2022-08:44:05] [I] Warmup completed 1 queries over 200 ms
[07/26/2022-08:44:05] [I] Timing trace has 93 queries over 2.88673 s
[07/26/2022-08:44:05] [I] 
[07/26/2022-08:44:05] [I] === Trace details ===
[07/26/2022-08:44:05] [I] Trace averages of 10 runs:
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 34.3636 ms - Host latency: 34.9215 ms (end to end 65.277 ms, enqueue 2.94725 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.1534 ms - Host latency: 30.6778 ms (end to end 59.8989 ms, enqueue 3.40982 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.2245 ms - Host latency: 30.7502 ms (end to end 60.2467 ms, enqueue 3.50318 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.2769 ms - Host latency: 30.8062 ms (end to end 59.5009 ms, enqueue 3.50552 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.4659 ms - Host latency: 30.9747 ms (end to end 60.7813 ms, enqueue 2.64545 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.508 ms - Host latency: 31.0339 ms (end to end 60.8048 ms, enqueue 3.31053 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 32.119 ms - Host latency: 32.6373 ms (end to end 64.0943 ms, enqueue 1.74031 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.4284 ms - Host latency: 30.949 ms (end to end 60.6614 ms, enqueue 2.69846 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.6439 ms - Host latency: 31.1798 ms (end to end 61.0422 ms, enqueue 3.30291 ms)
[07/26/2022-08:44:05] [I] 
[07/26/2022-08:44:05] [I] === Performance summary ===
[07/26/2022-08:44:05] [I] Throughput: 32.2164 qps
[07/26/2022-08:44:05] [I] Latency: min = 30.0452 ms, max = 38.9216 ms, mean = 31.5264 ms, median = 30.9492 ms, percentile(99%) = 38.9216 ms
[07/26/2022-08:44:05] [I] End-to-End Host Latency: min = 42.0844 ms, max = 76.3016 ms, mean = 61.3436 ms, median = 60.593 ms, percentile(99%) = 76.3016 ms
[07/26/2022-08:44:05] [I] Enqueue Time: min = 0.68335 ms, max = 4.70483 ms, mean = 3.02451 ms, median = 3.21851 ms, percentile(99%) = 4.70483 ms
[07/26/2022-08:44:05] [I] H2D Latency: min = 0.242432 ms, max = 0.47583 ms, mean = 0.313639 ms, median = 0.308838 ms, percentile(99%) = 0.47583 ms
[07/26/2022-08:44:05] [I] GPU Compute Time: min = 29.5245 ms, max = 38.4528 ms, mean = 30.999 ms, median = 30.438 ms, percentile(99%) = 38.4528 ms
[07/26/2022-08:44:05] [I] D2H Latency: min = 0.186035 ms, max = 0.218506 ms, mean = 0.213687 ms, median = 0.214111 ms, percentile(99%) = 0.218506 ms
[07/26/2022-08:44:05] [I] Total Host Walltime: 2.88673 s
[07/26/2022-08:44:05] [I] Total GPU Compute Time: 2.88291 s
[07/26/2022-08:44:05] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/26/2022-08:44:05] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512

Hi,

We recommend you to please use the latest TensorRT version 8.4 GA.
We couldn’t reproduce the issue using the latest TensorRT version.

&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /opt/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4

Thank you.