8.6.1 trtexec results:
[07/04/2023-15:52:15] [I] === Model Options ===
[07/04/2023-15:52:15] [I] Format: ONNX
[07/04/2023-15:52:15] [I] Model: test.onnx
[07/04/2023-15:52:15] [I] Output:
[07/04/2023-15:52:15] [I] === Build Options ===
[07/04/2023-15:52:15] [I] Max batch: explicit batch
[07/04/2023-15:52:15] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[07/04/2023-15:52:15] [I] minTiming: 1
[07/04/2023-15:52:15] [I] avgTiming: 8
[07/04/2023-15:52:15] [I] Precision: FP32
[07/04/2023-15:52:15] [I] LayerPrecisions:
[07/04/2023-15:52:15] [I] Layer Device Types:
[07/04/2023-15:52:15] [I] Calibration:
[07/04/2023-15:52:15] [I] Refit: Disabled
[07/04/2023-15:52:15] [I] Version Compatible: Disabled
[07/04/2023-15:52:15] [I] TensorRT runtime: full
[07/04/2023-15:52:15] [I] Lean DLL Path:
[07/04/2023-15:52:15] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[07/04/2023-15:52:15] [I] Exclude Lean Runtime: Disabled
[07/04/2023-15:52:15] [I] Sparsity: Disabled
[07/04/2023-15:52:15] [I] Safe mode: Disabled
[07/04/2023-15:52:15] [I] Build DLA standalone loadable: Disabled
[07/04/2023-15:52:15] [I] Allow GPU fallback for DLA: Disabled
[07/04/2023-15:52:15] [I] DirectIO mode: Disabled
[07/04/2023-15:52:15] [I] Restricted mode: Disabled
[07/04/2023-15:52:15] [I] Skip inference: Disabled
[07/04/2023-15:52:15] [I] Save engine: test.trt
[07/04/2023-15:52:15] [I] Load engine:
[07/04/2023-15:52:15] [I] Profiling verbosity: 0
[07/04/2023-15:52:15] [I] Tactic sources: Using default tactic sources
[07/04/2023-15:52:15] [I] timingCacheMode: local
[07/04/2023-15:52:15] [I] timingCacheFile:
[07/04/2023-15:52:15] [I] Heuristic: Disabled
[07/04/2023-15:52:15] [I] Preview Features: Use default preview flags.
[07/04/2023-15:52:15] [I] MaxAuxStreams: -1
[07/04/2023-15:52:15] [I] BuilderOptimizationLevel: -1
[07/04/2023-15:52:15] [I] Input(s)s format: fp32:CHW
[07/04/2023-15:52:15] [I] Output(s)s format: fp32:CHW
[07/04/2023-15:52:15] [I] Input build shapes: model
[07/04/2023-15:52:15] [I] Input calibration shapes: model
[07/04/2023-15:52:15] [I] === System Options ===
[07/04/2023-15:52:15] [I] Device: 0
[07/04/2023-15:52:15] [I] DLACore:
[07/04/2023-15:52:15] [I] Plugins:
[07/04/2023-15:52:15] [I] setPluginsToSerialize:
[07/04/2023-15:52:15] [I] dynamicPlugins:
[07/04/2023-15:52:15] [I] ignoreParsedPluginLibs: 0
[07/04/2023-15:52:15] [I]
[07/04/2023-15:52:15] [I] === Inference Options ===
[07/04/2023-15:52:15] [I] Batch: Explicit
[07/04/2023-15:52:15] [I] Input inference shapes: model
[07/04/2023-15:52:15] [I] Iterations: 10
[07/04/2023-15:52:15] [I] Duration: 3s (+ 200ms warm up)
[07/04/2023-15:52:15] [I] Sleep time: 0ms
[07/04/2023-15:52:15] [I] Idle time: 0ms
[07/04/2023-15:52:15] [I] Inference Streams: 1
[07/04/2023-15:52:15] [I] ExposeDMA: Disabled
[07/04/2023-15:52:15] [I] Data transfers: Enabled
[07/04/2023-15:52:15] [I] Spin-wait: Disabled
[07/04/2023-15:52:15] [I] Multithreading: Disabled
[07/04/2023-15:52:15] [I] CUDA Graph: Disabled
[07/04/2023-15:52:15] [I] Separate profiling: Disabled
[07/04/2023-15:52:15] [I] Time Deserialize: Disabled
[07/04/2023-15:52:15] [I] Time Refit: Disabled
[07/04/2023-15:52:15] [I] NVTX verbosity: 0
[07/04/2023-15:52:15] [I] Persistent Cache Ratio: 0
[07/04/2023-15:52:15] [I] Inputs:
[07/04/2023-15:52:15] [I] === Reporting Options ===
[07/04/2023-15:52:15] [I] Verbose: Disabled
[07/04/2023-15:52:15] [I] Averages: 10 inferences
[07/04/2023-15:52:15] [I] Percentiles: 90,95,99
[07/04/2023-15:52:15] [I] Dump refittable layers:Disabled
[07/04/2023-15:52:15] [I] Dump output: Disabled
[07/04/2023-15:52:15] [I] Profile: Disabled
[07/04/2023-15:52:15] [I] Export timing to JSON file:
[07/04/2023-15:52:15] [I] Export output to JSON file:
[07/04/2023-15:52:15] [I] Export profile to JSON file:
[07/04/2023-15:52:15] [I]
[07/04/2023-15:52:15] [I] === Device Information ===
[07/04/2023-15:52:15] [I] Selected Device: NVIDIA GeForce RTX 3060 Laptop GPU
[07/04/2023-15:52:15] [I] Compute Capability: 8.6
[07/04/2023-15:52:15] [I] SMs: 30
[07/04/2023-15:52:15] [I] Device Global Memory: 6143 MiB
[07/04/2023-15:52:15] [I] Shared Memory per SM: 100 KiB
[07/04/2023-15:52:15] [I] Memory Bus Width: 192 bits (ECC disabled)
[07/04/2023-15:52:15] [I] Application Compute Clock Rate: 1.702 GHz
[07/04/2023-15:52:15] [I] Application Memory Clock Rate: 7.001 GHz
[07/04/2023-15:52:15] [I]
[07/04/2023-15:52:15] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/04/2023-15:52:15] [I]
[07/04/2023-15:52:15] [I] TensorRT version: 8.6.1
[07/04/2023-15:52:15] [I] Loading standard plugins
[07/04/2023-15:52:15] [I] [TRT] [MemUsageChange] Init CUDA: CPU +284, GPU +0, now: CPU 12241, GPU 1072 (MiB)
[07/04/2023-15:52:19] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1162, GPU +264, now: CPU 14543, GPU 1336 (MiB)
[07/04/2023-15:52:19] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C++ Programming Guide
[07/04/2023-15:52:19] [I] Start parsing network model.
[07/04/2023-15:52:19] [I] [TRT] ----------------------------------------------------------------
[07/04/2023-15:52:19] [I] [TRT] Input filename: test.onnx
[07/04/2023-15:52:19] [I] [TRT] ONNX IR version: 0.0.6
[07/04/2023-15:52:19] [I] [TRT] Opset version: 11
[07/04/2023-15:52:19] [I] [TRT] Producer name: pytorch
[07/04/2023-15:52:19] [I] [TRT] Producer version: 1.9
[07/04/2023-15:52:19] [I] [TRT] Domain:
[07/04/2023-15:52:19] [I] [TRT] Model version: 0
[07/04/2023-15:52:19] [I] [TRT] Doc string:
[07/04/2023-15:52:19] [I] [TRT] ----------------------------------------------------------------
[07/04/2023-15:52:19] [I] Finished parsing network model. Parse time: 0.110906
[07/04/2023-15:52:19] [I] [TRT] Graph optimization time: 0.0062036 seconds.
[07/04/2023-15:52:19] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/04/2023-15:52:45] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[07/04/2023-15:52:45] [I] [TRT] Total Host Persistent Memory: 262480
[07/04/2023-15:52:45] [I] [TRT] Total Device Persistent Memory: 5120
[07/04/2023-15:52:45] [I] [TRT] Total Scratch Memory: 8652800
[07/04/2023-15:52:45] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 25 MiB, GPU 116 MiB
[07/04/2023-15:52:45] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 115 steps to complete.
[07/04/2023-15:52:45] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 6.9475ms to assign 10 blocks to 115 nodes requiring 45090304 bytes.
[07/04/2023-15:52:45] [I] [TRT] Total Activation Memory: 45088768
[07/04/2023-15:52:45] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +116, now: CPU 0, GPU 116 (MiB)
[07/04/2023-15:52:46] [I] Engine built in 30.9293 sec.
[07/04/2023-15:52:46] [I] [TRT] Loaded engine size: 117 MiB
[07/04/2023-15:52:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +115, now: CPU 0, GPU 115 (MiB)
[07/04/2023-15:52:46] [I] Engine deserialized in 0.0321676 sec.
[07/04/2023-15:52:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +43, now: CPU 0, GPU 158 (MiB)
[07/04/2023-15:52:46] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C++ Programming Guide
[07/04/2023-15:52:46] [I] Setting persistentCacheLimit to 0 bytes.
[07/04/2023-15:52:46] [I] Using random values for input input
[07/04/2023-15:52:46] [I] Input binding for input with dimensions 1x3x512x512 is created.
[07/04/2023-15:52:46] [I] Output binding for output with dimensions 1x64x128x128 is created.
[07/04/2023-15:52:46] [I] Starting inference
[07/04/2023-15:52:49] [I] Warmup completed 6 queries over 200 ms
[07/04/2023-15:52:49] [I] Timing trace has 89 queries over 3.0655 s
[07/04/2023-15:52:49] [I]
[07/04/2023-15:52:49] [I] === Trace details ===
[07/04/2023-15:52:49] [I] Trace averages of 10 runs:
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 33.0017 ms - Host latency: 33.5733 ms (enqueue 33.4361 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 34.3227 ms - Host latency: 34.8911 ms (enqueue 34.0019 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 35.0873 ms - Host latency: 35.6554 ms (enqueue 35.6215 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 33.5733 ms - Host latency: 34.1425 ms (enqueue 34.2074 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 32.4652 ms - Host latency: 33.0368 ms (enqueue 32.9492 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 34.9549 ms - Host latency: 35.5234 ms (enqueue 35.1269 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 35.5306 ms - Host latency: 36.0989 ms (enqueue 36.6388 ms)
[07/04/2023-15:52:49] [I] Average on 10 runs - GPU latency: 32.14 ms - Host latency: 32.709 ms (enqueue 32.3849 ms)
[07/04/2023-15:52:49] [I]
[07/04/2023-15:52:49] [I] === Performance summary ===
[07/04/2023-15:52:49] [I] Throughput: 29.0328 qps
[07/04/2023-15:52:49] [I] Latency: min = 29.62 ms, max = 40.5161 ms, mean = 34.3186 ms, median = 34.5432 ms, percentile(90%) = 37.9171 ms, percentile(95%) = 38.6229 ms, percentile(99%) = 40.5161 ms
[07/04/2023-15:52:49] [I] Enqueue Time: min = 22.5655 ms, max = 44.9667 ms, mean = 34.2031 ms, median = 33.9287 ms, percentile(90%) = 39.4993 ms, percentile(95%) = 40.7681 ms, percentile(99%) = 44.9667 ms
[07/04/2023-15:52:49] [I] H2D Latency: min = 0.245361 ms, max = 0.272644 ms, mean = 0.246794 ms, median = 0.246399 ms, percentile(90%) = 0.24707 ms, percentile(95%) = 0.247925 ms, percentile(99%) = 0.272644 ms
[07/04/2023-15:52:49] [I] GPU Compute Time: min = 29.0519 ms, max = 39.9473 ms, mean = 33.7494 ms, median = 33.9743 ms, percentile(90%) = 37.3484 ms, percentile(95%) = 38.0539 ms, percentile(99%) = 39.9473 ms
[07/04/2023-15:52:49] [I] D2H Latency: min = 0.321533 ms, max = 0.346924 ms, mean = 0.322436 ms, median = 0.321777 ms, percentile(90%) = 0.322266 ms, percentile(95%) = 0.325439 ms, percentile(99%) = 0.346924 ms
[07/04/2023-15:52:49] [I] Total Host Walltime: 3.0655 s
[07/04/2023-15:52:49] [I] Total GPU Compute Time: 3.00369 s
[07/04/2023-15:52:49] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/04/2023-15:52:49] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/04/2023-15:52:49] [W] * GPU compute time is unstable, with coefficient of variance = 8.9478%.
[07/04/2023-15:52:49] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/04/2023-15:52:49] [I] Explanations of the performance metrics are printed in the verbose logs.
8.2.1 trtexec results:
[07/04/2023-16:00:00] [I] === Model Options ===
[07/04/2023-16:00:00] [I] Format: ONNX
[07/04/2023-16:00:00] [I] Model: test.onnx
[07/04/2023-16:00:00] [I] Output:
[07/04/2023-16:00:00] [I] === Build Options ===
[07/04/2023-16:00:00] [I] Max batch: explicit batch
[07/04/2023-16:00:00] [I] Workspace: 16 MiB
[07/04/2023-16:00:00] [I] minTiming: 1
[07/04/2023-16:00:00] [I] avgTiming: 8
[07/04/2023-16:00:00] [I] Precision: FP32
[07/04/2023-16:00:00] [I] Calibration:
[07/04/2023-16:00:00] [I] Refit: Disabled
[07/04/2023-16:00:00] [I] Sparsity: Disabled
[07/04/2023-16:00:00] [I] Safe mode: Disabled
[07/04/2023-16:00:00] [I] DirectIO mode: Disabled
[07/04/2023-16:00:00] [I] Restricted mode: Disabled
[07/04/2023-16:00:00] [I] Save engine: test.trt
[07/04/2023-16:00:00] [I] Load engine:
[07/04/2023-16:00:00] [I] Profiling verbosity: 0
[07/04/2023-16:00:00] [I] Tactic sources: Using default tactic sources
[07/04/2023-16:00:00] [I] timingCacheMode: local
[07/04/2023-16:00:00] [I] timingCacheFile:
[07/04/2023-16:00:00] [I] Input(s)s format: fp32:CHW
[07/04/2023-16:00:00] [I] Output(s)s format: fp32:CHW
[07/04/2023-16:00:00] [I] Input build shapes: model
[07/04/2023-16:00:00] [I] Input calibration shapes: model
[07/04/2023-16:00:00] [I] === System Options ===
[07/04/2023-16:00:00] [I] Device: 0
[07/04/2023-16:00:00] [I] DLACore:
[07/04/2023-16:00:00] [I] Plugins:
[07/04/2023-16:00:00] [I] === Inference Options ===
[07/04/2023-16:00:00] [I] Batch: Explicit
[07/04/2023-16:00:00] [I] Input inference shapes: model
[07/04/2023-16:00:00] [I] Iterations: 10
[07/04/2023-16:00:00] [I] Duration: 3s (+ 200ms warm up)
[07/04/2023-16:00:00] [I] Sleep time: 0ms
[07/04/2023-16:00:00] [I] Idle time: 0ms
[07/04/2023-16:00:00] [I] Streams: 1
[07/04/2023-16:00:00] [I] ExposeDMA: Disabled
[07/04/2023-16:00:00] [I] Data transfers: Enabled
[07/04/2023-16:00:00] [I] Spin-wait: Disabled
[07/04/2023-16:00:00] [I] Multithreading: Disabled
[07/04/2023-16:00:00] [I] CUDA Graph: Disabled
[07/04/2023-16:00:00] [I] Separate profiling: Disabled
[07/04/2023-16:00:00] [I] Time Deserialize: Disabled
[07/04/2023-16:00:00] [I] Time Refit: Disabled
[07/04/2023-16:00:00] [I] Skip inference: Disabled
[07/04/2023-16:00:00] [I] Inputs:
[07/04/2023-16:00:00] [I] === Reporting Options ===
[07/04/2023-16:00:00] [I] Verbose: Disabled
[07/04/2023-16:00:00] [I] Averages: 10 inferences
[07/04/2023-16:00:00] [I] Percentile: 99
[07/04/2023-16:00:00] [I] Dump refittable layers:Disabled
[07/04/2023-16:00:00] [I] Dump output: Disabled
[07/04/2023-16:00:00] [I] Profile: Disabled
[07/04/2023-16:00:00] [I] Export timing to JSON file:
[07/04/2023-16:00:00] [I] Export output to JSON file:
[07/04/2023-16:00:00] [I] Export profile to JSON file:
[07/04/2023-16:00:00] [I]
[07/04/2023-16:00:00] [I] === Device Information ===
[07/04/2023-16:00:00] [I] Selected Device: NVIDIA GeForce RTX 3060 Laptop GPU
[07/04/2023-16:00:00] [I] Compute Capability: 8.6
[07/04/2023-16:00:00] [I] SMs: 30
[07/04/2023-16:00:00] [I] Compute Clock Rate: 1.702 GHz
[07/04/2023-16:00:00] [I] Device Global Memory: 6143 MiB
[07/04/2023-16:00:00] [I] Shared Memory per SM: 100 KiB
[07/04/2023-16:00:00] [I] Memory Bus Width: 192 bits (ECC disabled)
[07/04/2023-16:00:00] [I] Memory Clock Rate: 7.001 GHz
[07/04/2023-16:00:00] [I]
[07/04/2023-16:00:00] [I] TensorRT version: 8.2.1
[07/04/2023-16:00:00] [I] [TRT] [MemUsageChange] Init CUDA: CPU +277, GPU +0, now: CPU 10622, GPU 1072 (MiB)
[07/04/2023-16:00:11] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1199, GPU +264, now: CPU 12819, GPU 1336 (MiB)
[07/04/2023-16:00:11] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C++ Programming Guide
[07/04/2023-16:00:11] [I] Start parsing network model
[07/04/2023-16:00:11] [I] [TRT] ----------------------------------------------------------------
[07/04/2023-16:00:11] [I] [TRT] Input filename: test.onnx
[07/04/2023-16:00:11] [I] [TRT] ONNX IR version: 0.0.6
[07/04/2023-16:00:11] [I] [TRT] Opset version: 11
[07/04/2023-16:00:11] [I] [TRT] Producer name: pytorch
[07/04/2023-16:00:11] [I] [TRT] Producer version: 1.9
[07/04/2023-16:00:11] [I] [TRT] Domain:
[07/04/2023-16:00:11] [I] [TRT] Model version: 0
[07/04/2023-16:00:11] [I] [TRT] Doc string:
[07/04/2023-16:00:11] [I] [TRT] ----------------------------------------------------------------
[07/04/2023-16:00:11] [I] Finish parsing network model
[07/04/2023-16:00:11] [I] [TRT] Graph optimization time: 0.0065841 seconds.
[07/04/2023-16:00:11] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[07/04/2023-16:00:33] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[07/04/2023-16:00:33] [I] [TRT] Total Host Persistent Memory: 260208
[07/04/2023-16:00:33] [I] [TRT] Total Device Persistent Memory: 236032
[07/04/2023-16:00:33] [I] [TRT] Total Scratch Memory: 2098176
[07/04/2023-16:00:33] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 25 MiB, GPU 119 MiB
[07/04/2023-16:00:33] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 110 steps to complete.
[07/04/2023-16:00:33] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 6.444ms to assign 8 blocks to 110 nodes requiring 45089280 bytes.
[07/04/2023-16:00:33] [I] [TRT] Total Activation Memory: 45088768
[07/04/2023-16:00:33] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +115, now: CPU 0, GPU 115 (MiB)
[07/04/2023-16:00:33] [I] [TRT] Loaded engine size: 117 MiB
[07/04/2023-16:00:33] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +114, now: CPU 0, GPU 114 (MiB)
[07/04/2023-16:00:33] [E] Error[3]: [runtime.cpp::nvinfer1::Runtime::~Runtime::346] Error Code 3: API Usage Error (Parameter check failed at: runtime.cpp::nvinfer1::Runtime::~Runtime::346, condition: mEngineCounter.use_count() == 1. Destroying a runtime before destroying deserialized engines created by the runtime leads to undefined behavior.
)
[07/04/2023-16:00:33] [E] Error[3]: [builder.cpp::nvinfer1::builder::Builder::~Builder::341] Error Code 3: API Usage Error (Parameter check failed at: builder.cpp::nvinfer1::builder::Builder::~Builder::341, condition: mObjectCounter.use_count() == 1. Destroying a builder object before destroying objects it created leads to undefined behavior.
)
[07/04/2023-16:00:33] [I] Engine built in 33.3251 sec.
[07/04/2023-16:00:33] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +43, now: CPU 0, GPU 157 (MiB)
[07/04/2023-16:00:33] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See “Lazy Loading” section of CUDA documentation CUDA C++ Programming Guide
[07/04/2023-16:00:33] [I] Using random values for input input
[07/04/2023-16:00:33] [I] Created input binding for input with dimensions 1x3x512x512
[07/04/2023-16:00:34] [I] Using random values for output output
[07/04/2023-16:00:34] [I] Created output binding for output with dimensions 1x64x128x128
[07/04/2023-16:00:34] [I] Starting inference
[07/04/2023-16:00:37] [I] Warmup completed 10 queries over 200 ms
[07/04/2023-16:00:37] [I] Timing trace has 148 queries over 3.03358 s
[07/04/2023-16:00:37] [I]
[07/04/2023-16:00:37] [I] === Trace details ===
[07/04/2023-16:00:37] [I] Trace averages of 10 runs:
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 21.458 ms - Host latency: 22.0232 ms (end to end 22.1509 ms, enqueue 21.1855 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 20.8364 ms - Host latency: 21.4018 ms (end to end 21.5103 ms, enqueue 20.0202 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.4282 ms - Host latency: 19.9976 ms (end to end 20.1049 ms, enqueue 19.8673 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.3706 ms - Host latency: 19.9353 ms (end to end 20.0339 ms, enqueue 18.2692 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 21.314 ms - Host latency: 21.8795 ms (end to end 22.0339 ms, enqueue 20.7183 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.4384 ms - Host latency: 20.0034 ms (end to end 20.101 ms, enqueue 19.0662 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 18.748 ms - Host latency: 19.3135 ms (end to end 19.4005 ms, enqueue 19.0619 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 18.4604 ms - Host latency: 19.0252 ms (end to end 19.11 ms, enqueue 16.8491 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 18.3616 ms - Host latency: 18.9261 ms (end to end 19.0176 ms, enqueue 18.4146 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 20.4157 ms - Host latency: 20.9807 ms (end to end 21.1246 ms, enqueue 19.4072 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.2681 ms - Host latency: 19.8331 ms (end to end 19.9717 ms, enqueue 18.7269 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 20.3905 ms - Host latency: 20.9555 ms (end to end 21.0898 ms, enqueue 20.3416 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 20.8597 ms - Host latency: 21.425 ms (end to end 21.5364 ms, enqueue 21.4828 ms)
[07/04/2023-16:00:37] [I] Average on 10 runs - GPU latency: 19.6132 ms - Host latency: 20.1783 ms (end to end 20.2763 ms, enqueue 19.0764 ms)
[07/04/2023-16:00:37] [I]
[07/04/2023-16:00:37] [I] === Performance summary ===
[07/04/2023-16:00:37] [I] Throughput: 48.7872 qps
[07/04/2023-16:00:37] [I] Latency: min = 16.9426 ms, max = 25.7742 ms, mean = 20.3844 ms, median = 21.1941 ms, percentile(99%) = 25.6042 ms
[07/04/2023-16:00:37] [I] End-to-End Host Latency: min = 17.0652 ms, max = 25.8479 ms, mean = 20.4956 ms, median = 21.315 ms, percentile(99%) = 25.7107 ms
[07/04/2023-16:00:37] [I] Enqueue Time: min = 9.98462 ms, max = 27.6577 ms, mean = 19.4512 ms, median = 19.9514 ms, percentile(99%) = 24.9242 ms
[07/04/2023-16:00:37] [I] H2D Latency: min = 0.243652 ms, max = 0.251709 ms, mean = 0.24467 ms, median = 0.244629 ms, percentile(99%) = 0.245728 ms
[07/04/2023-16:00:37] [I] GPU Compute Time: min = 16.3779 ms, max = 25.2097 ms, mean = 19.819 ms, median = 20.6289 ms, percentile(99%) = 25.0398 ms
[07/04/2023-16:00:37] [I] D2H Latency: min = 0.320068 ms, max = 0.361877 ms, mean = 0.32066 ms, median = 0.320313 ms, percentile(99%) = 0.32251 ms
[07/04/2023-16:00:37] [I] Total Host Walltime: 3.03358 s
[07/04/2023-16:00:37] [I] Total GPU Compute Time: 2.93322 s
[07/04/2023-16:00:37] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/04/2023-16:00:37] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/04/2023-16:00:37] [I] Explanations of the performance metrics are printed in the verbose logs.