Other files are all uploaed!
Here is trtexec result
- trtexec --loadEngine=dynamic_batch.plan --batch=4 => fail.
- trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512 => passed
root@9dd4dce9103b:/workspace/simswap2trt/2trt# /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4
&&&& RUNNING TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4
[07/26/2022-08:43:28] [I] === Model Options ===
[07/26/2022-08:43:28] [I] Format: *
[07/26/2022-08:43:28] [I] Model:
[07/26/2022-08:43:28] [I] Output:
[07/26/2022-08:43:28] [I] === Build Options ===
[07/26/2022-08:43:28] [I] Max batch: 4
[07/26/2022-08:43:28] [I] Workspace: 16 MiB
[07/26/2022-08:43:28] [I] minTiming: 1
[07/26/2022-08:43:28] [I] avgTiming: 8
[07/26/2022-08:43:28] [I] Precision: FP32
[07/26/2022-08:43:28] [I] Calibration:
[07/26/2022-08:43:28] [I] Refit: Disabled
[07/26/2022-08:43:28] [I] Sparsity: Disabled
[07/26/2022-08:43:28] [I] Safe mode: Disabled
[07/26/2022-08:43:28] [I] DirectIO mode: Disabled
[07/26/2022-08:43:28] [I] Restricted mode: Disabled
[07/26/2022-08:43:28] [I] Save engine:
[07/26/2022-08:43:28] [I] Load engine: dynamic_batch.plan
[07/26/2022-08:43:28] [I] Profiling verbosity: 0
[07/26/2022-08:43:28] [I] Tactic sources: Using default tactic sources
[07/26/2022-08:43:28] [I] timingCacheMode: local
[07/26/2022-08:43:28] [I] timingCacheFile:
[07/26/2022-08:43:28] [I] Input(s)s format: fp32:CHW
[07/26/2022-08:43:28] [I] Output(s)s format: fp32:CHW
[07/26/2022-08:43:28] [I] Input build shapes: model
[07/26/2022-08:43:28] [I] Input calibration shapes: model
[07/26/2022-08:43:28] [I] === System Options ===
[07/26/2022-08:43:28] [I] Device: 0
[07/26/2022-08:43:28] [I] DLACore:
[07/26/2022-08:43:28] [I] Plugins:
[07/26/2022-08:43:28] [I] === Inference Options ===
[07/26/2022-08:43:28] [I] Batch: 4
[07/26/2022-08:43:28] [I] Input inference shapes: model
[07/26/2022-08:43:28] [I] Iterations: 10
[07/26/2022-08:43:28] [I] Duration: 3s (+ 200ms warm up)
[07/26/2022-08:43:28] [I] Sleep time: 0ms
[07/26/2022-08:43:28] [I] Idle time: 0ms
[07/26/2022-08:43:28] [I] Streams: 1
[07/26/2022-08:43:28] [I] ExposeDMA: Disabled
[07/26/2022-08:43:28] [I] Data transfers: Enabled
[07/26/2022-08:43:28] [I] Spin-wait: Disabled
[07/26/2022-08:43:28] [I] Multithreading: Disabled
[07/26/2022-08:43:28] [I] CUDA Graph: Disabled
[07/26/2022-08:43:28] [I] Separate profiling: Disabled
[07/26/2022-08:43:28] [I] Time Deserialize: Disabled
[07/26/2022-08:43:28] [I] Time Refit: Disabled
[07/26/2022-08:43:28] [I] Skip inference: Disabled
[07/26/2022-08:43:28] [I] Inputs:
[07/26/2022-08:43:28] [I] === Reporting Options ===
[07/26/2022-08:43:28] [I] Verbose: Disabled
[07/26/2022-08:43:28] [I] Averages: 10 inferences
[07/26/2022-08:43:28] [I] Percentile: 99
[07/26/2022-08:43:28] [I] Dump refittable layers:Disabled
[07/26/2022-08:43:28] [I] Dump output: Disabled
[07/26/2022-08:43:28] [I] Profile: Disabled
[07/26/2022-08:43:28] [I] Export timing to JSON file:
[07/26/2022-08:43:28] [I] Export output to JSON file:
[07/26/2022-08:43:28] [I] Export profile to JSON file:
[07/26/2022-08:43:28] [I]
[07/26/2022-08:43:28] [I] === Device Information ===
[07/26/2022-08:43:28] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[07/26/2022-08:43:28] [I] Compute Capability: 7.5
[07/26/2022-08:43:28] [I] SMs: 68
[07/26/2022-08:43:28] [I] Compute Clock Rate: 1.65 GHz
[07/26/2022-08:43:28] [I] Device Global Memory: 11011 MiB
[07/26/2022-08:43:28] [I] Shared Memory per SM: 64 KiB
[07/26/2022-08:43:28] [I] Memory Bus Width: 352 bits (ECC disabled)
[07/26/2022-08:43:28] [I] Memory Clock Rate: 7 GHz
[07/26/2022-08:43:28] [I]
[07/26/2022-08:43:28] [I] TensorRT version: 8.2.4
[07/26/2022-08:43:28] [I] [TRT] [MemUsageChange] Init CUDA: CPU +321, GPU +0, now: CPU 879, GPU 870 (MiB)
[07/26/2022-08:43:28] [I] [TRT] Loaded engine size: 545 MiB
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +513, GPU +222, now: CPU 1431, GPU 1604 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +116, GPU +54, now: CPU 1547, GPU 1658 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +508, now: CPU 0, GPU 508 (MiB)
[07/26/2022-08:43:29] [I] Engine loaded in 0.994565 sec.
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1001, GPU 1650 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1001, GPU 1658 (MiB)
[07/26/2022-08:43:29] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +627, now: CPU 0, GPU 1135 (MiB)
[07/26/2022-08:43:29] [I] Using random values for input x1
[07/26/2022-08:43:29] [I] Created input binding for x1 with dimensions 4x3x224x224
[07/26/2022-08:43:29] [I] Using random values for input x2
[07/26/2022-08:43:29] [I] Created input binding for x2 with dimensions 4x512
[07/26/2022-08:43:29] [I] Using random values for output outputs
[07/26/2022-08:43:29] [I] Created output binding for outputs with dimensions 4x3x224x224
[07/26/2022-08:43:29] [I] Starting inference
[07/26/2022-08:43:29] [E] Error[3]: [executionContext.cpp::enqueue::284] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::enqueue::284, condition: batchSize > 0 && batchSize <= mEngine.getMaxBatchSize(). Note: Batch size was: 4, but engine max batch size was: 1
)
[07/26/2022-08:43:29] [E] Error occurred during inference
&&&& FAILED TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --batch=4
root@9dd4dce9103b:/workspace/simswap2trt/2trt# /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512
&&&& RUNNING TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512
[07/26/2022-08:44:01] [I] === Model Options ===
[07/26/2022-08:44:01] [I] Format: *
[07/26/2022-08:44:01] [I] Model:
[07/26/2022-08:44:01] [I] Output:
[07/26/2022-08:44:01] [I] === Build Options ===
[07/26/2022-08:44:01] [I] Max batch: explicit batch
[07/26/2022-08:44:01] [I] Workspace: 16 MiB
[07/26/2022-08:44:01] [I] minTiming: 1
[07/26/2022-08:44:01] [I] avgTiming: 8
[07/26/2022-08:44:01] [I] Precision: FP32
[07/26/2022-08:44:01] [I] Calibration:
[07/26/2022-08:44:01] [I] Refit: Disabled
[07/26/2022-08:44:01] [I] Sparsity: Disabled
[07/26/2022-08:44:01] [I] Safe mode: Disabled
[07/26/2022-08:44:01] [I] DirectIO mode: Disabled
[07/26/2022-08:44:01] [I] Restricted mode: Disabled
[07/26/2022-08:44:01] [I] Save engine:
[07/26/2022-08:44:01] [I] Load engine: dynamic_batch.plan
[07/26/2022-08:44:01] [I] Profiling verbosity: 0
[07/26/2022-08:44:01] [I] Tactic sources: Using default tactic sources
[07/26/2022-08:44:01] [I] timingCacheMode: local
[07/26/2022-08:44:01] [I] timingCacheFile:
[07/26/2022-08:44:01] [I] Input(s)s format: fp32:CHW
[07/26/2022-08:44:01] [I] Output(s)s format: fp32:CHW
[07/26/2022-08:44:01] [I] Input build shape: x1=4x3x224x224+4x3x224x224+4x3x224x224
[07/26/2022-08:44:01] [I] Input build shape: x2=4x512+4x512+4x512
[07/26/2022-08:44:01] [I] Input calibration shapes: model
[07/26/2022-08:44:01] [I] === System Options ===
[07/26/2022-08:44:01] [I] Device: 0
[07/26/2022-08:44:01] [I] DLACore:
[07/26/2022-08:44:01] [I] Plugins:
[07/26/2022-08:44:01] [I] === Inference Options ===
[07/26/2022-08:44:01] [I] Batch: Explicit
[07/26/2022-08:44:01] [I] Input inference shape: x2=4x512
[07/26/2022-08:44:01] [I] Input inference shape: x1=4x3x224x224
[07/26/2022-08:44:01] [I] Iterations: 10
[07/26/2022-08:44:01] [I] Duration: 3s (+ 200ms warm up)
[07/26/2022-08:44:01] [I] Sleep time: 0ms
[07/26/2022-08:44:01] [I] Idle time: 0ms
[07/26/2022-08:44:01] [I] Streams: 1
[07/26/2022-08:44:01] [I] ExposeDMA: Disabled
[07/26/2022-08:44:01] [I] Data transfers: Enabled
[07/26/2022-08:44:01] [I] Spin-wait: Disabled
[07/26/2022-08:44:01] [I] Multithreading: Disabled
[07/26/2022-08:44:01] [I] CUDA Graph: Disabled
[07/26/2022-08:44:01] [I] Separate profiling: Disabled
[07/26/2022-08:44:01] [I] Time Deserialize: Disabled
[07/26/2022-08:44:01] [I] Time Refit: Disabled
[07/26/2022-08:44:01] [I] Skip inference: Disabled
[07/26/2022-08:44:01] [I] Inputs:
[07/26/2022-08:44:01] [I] === Reporting Options ===
[07/26/2022-08:44:01] [I] Verbose: Disabled
[07/26/2022-08:44:01] [I] Averages: 10 inferences
[07/26/2022-08:44:01] [I] Percentile: 99
[07/26/2022-08:44:01] [I] Dump refittable layers:Disabled
[07/26/2022-08:44:01] [I] Dump output: Disabled
[07/26/2022-08:44:01] [I] Profile: Disabled
[07/26/2022-08:44:01] [I] Export timing to JSON file:
[07/26/2022-08:44:01] [I] Export output to JSON file:
[07/26/2022-08:44:01] [I] Export profile to JSON file:
[07/26/2022-08:44:01] [I]
[07/26/2022-08:44:01] [I] === Device Information ===
[07/26/2022-08:44:01] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[07/26/2022-08:44:01] [I] Compute Capability: 7.5
[07/26/2022-08:44:01] [I] SMs: 68
[07/26/2022-08:44:01] [I] Compute Clock Rate: 1.65 GHz
[07/26/2022-08:44:01] [I] Device Global Memory: 11011 MiB
[07/26/2022-08:44:01] [I] Shared Memory per SM: 64 KiB
[07/26/2022-08:44:01] [I] Memory Bus Width: 352 bits (ECC disabled)
[07/26/2022-08:44:01] [I] Memory Clock Rate: 7 GHz
[07/26/2022-08:44:01] [I]
[07/26/2022-08:44:01] [I] TensorRT version: 8.2.4
[07/26/2022-08:44:01] [I] [TRT] [MemUsageChange] Init CUDA: CPU +321, GPU +0, now: CPU 879, GPU 870 (MiB)
[07/26/2022-08:44:01] [I] [TRT] Loaded engine size: 545 MiB
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +513, GPU +222, now: CPU 1431, GPU 1604 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +116, GPU +54, now: CPU 1547, GPU 1658 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +508, now: CPU 0, GPU 508 (MiB)
[07/26/2022-08:44:02] [I] Engine loaded in 0.983595 sec.
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1001, GPU 1650 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1001, GPU 1658 (MiB)
[07/26/2022-08:44:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +627, now: CPU 0, GPU 1135 (MiB)
[07/26/2022-08:44:02] [I] Using random values for input x1
[07/26/2022-08:44:02] [I] Created input binding for x1 with dimensions 4x3x224x224
[07/26/2022-08:44:02] [I] Using random values for input x2
[07/26/2022-08:44:02] [I] Created input binding for x2 with dimensions 4x512
[07/26/2022-08:44:02] [I] Using random values for output outputs
[07/26/2022-08:44:02] [I] Created output binding for outputs with dimensions 4x3x224x224
[07/26/2022-08:44:02] [I] Starting inference
[07/26/2022-08:44:05] [I] Warmup completed 1 queries over 200 ms
[07/26/2022-08:44:05] [I] Timing trace has 93 queries over 2.88673 s
[07/26/2022-08:44:05] [I]
[07/26/2022-08:44:05] [I] === Trace details ===
[07/26/2022-08:44:05] [I] Trace averages of 10 runs:
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 34.3636 ms - Host latency: 34.9215 ms (end to end 65.277 ms, enqueue 2.94725 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.1534 ms - Host latency: 30.6778 ms (end to end 59.8989 ms, enqueue 3.40982 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.2245 ms - Host latency: 30.7502 ms (end to end 60.2467 ms, enqueue 3.50318 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.2769 ms - Host latency: 30.8062 ms (end to end 59.5009 ms, enqueue 3.50552 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.4659 ms - Host latency: 30.9747 ms (end to end 60.7813 ms, enqueue 2.64545 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.508 ms - Host latency: 31.0339 ms (end to end 60.8048 ms, enqueue 3.31053 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 32.119 ms - Host latency: 32.6373 ms (end to end 64.0943 ms, enqueue 1.74031 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.4284 ms - Host latency: 30.949 ms (end to end 60.6614 ms, enqueue 2.69846 ms)
[07/26/2022-08:44:05] [I] Average on 10 runs - GPU latency: 30.6439 ms - Host latency: 31.1798 ms (end to end 61.0422 ms, enqueue 3.30291 ms)
[07/26/2022-08:44:05] [I]
[07/26/2022-08:44:05] [I] === Performance summary ===
[07/26/2022-08:44:05] [I] Throughput: 32.2164 qps
[07/26/2022-08:44:05] [I] Latency: min = 30.0452 ms, max = 38.9216 ms, mean = 31.5264 ms, median = 30.9492 ms, percentile(99%) = 38.9216 ms
[07/26/2022-08:44:05] [I] End-to-End Host Latency: min = 42.0844 ms, max = 76.3016 ms, mean = 61.3436 ms, median = 60.593 ms, percentile(99%) = 76.3016 ms
[07/26/2022-08:44:05] [I] Enqueue Time: min = 0.68335 ms, max = 4.70483 ms, mean = 3.02451 ms, median = 3.21851 ms, percentile(99%) = 4.70483 ms
[07/26/2022-08:44:05] [I] H2D Latency: min = 0.242432 ms, max = 0.47583 ms, mean = 0.313639 ms, median = 0.308838 ms, percentile(99%) = 0.47583 ms
[07/26/2022-08:44:05] [I] GPU Compute Time: min = 29.5245 ms, max = 38.4528 ms, mean = 30.999 ms, median = 30.438 ms, percentile(99%) = 38.4528 ms
[07/26/2022-08:44:05] [I] D2H Latency: min = 0.186035 ms, max = 0.218506 ms, mean = 0.213687 ms, median = 0.214111 ms, percentile(99%) = 0.218506 ms
[07/26/2022-08:44:05] [I] Total Host Walltime: 2.88673 s
[07/26/2022-08:44:05] [I] Total GPU Compute Time: 2.88291 s
[07/26/2022-08:44:05] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/26/2022-08:44:05] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8204] # /workspace/tensorrt/bin/trtexec --loadEngine=dynamic_batch.plan --shapes=x1:4x3x224x224,x2:4x512