What kind of hardware rigs can support 100+ videos analytics using deepstream?

Hi @gabe_ddi,
Sorry for delay!
You can refer to TensporRT sample - samples/sampleINT8, reference steps:
cd tensorrt/data/mnist/ // download https://github.com/BVLC/caffe/blob/master/data/mnist/get_mnist.sh into this folder ./get_mnist.sh // download the images
cd tensorrt/samples/sampleINT8/ make
$ ./…/…/bin/sample_int8 --int8 // then it will generate the INT8 calibration table

for you, by referring to thie sample, you need

  1. the model in this sample is MNIST, you need to change to your own model which can run at FP32 precision since,during INT8 calibration, model is actually run at FP32 precision
  2. current calibrator read the image which is raw data, if you want to use jpg or png images for calibratyion, you need to implement to have OpenCV or other to read the image.
  3. you can change the calibration algorithm, as mentioned in https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#optimizing_int8_c, there are several calibration algo.

How about if you don’t use dynamic shape?

Thanks!

I tested YOLOv4 batch=8, res=3x416x416 as below, its perf = (1000 ms / 29.0896 ms) * 8 = 275 fps

// launch NGC docker
$ nvidia-docker run -it --net=host --ipc=host --publish 0.0.0.0:6006:6006 -v /home/$user/:/home/$user/ --rm nvcr.io/nvidia/pytorch:20.10-py3

// operate in docker
 # git clone https://github.com/Tianxiaomo/pytorch-YOLOv4.git
// download pth from
   * baidu
  >  yolov4.pth(https://pan.baidu.com/s/1ZroDvoGScDgtE1ja_QqJVw Extraction code:xrq9)
  > yolov4.conv.137.pth(https://pan.baidu.com/s/1ovBie4YyVQQoUrC3AY0joA Extraction code:kcel)
* google
  > yolov4.pth(https://drive.google.com/open?id=1wv_LiFeCRYwtpkqREPeI13-gPELBDwuJ)
  > yolov4.conv.137.pth(https://drive.google.com/open?id=1fcbR0bWzYfIEdLJPzOsn4R5mlvR6IQyA)
# python demo_pytorch2onnx.py yolov4.pth dog.jpg 8 80 416 416   // generate  yolov4_8_3_416_416_static.onnx
# trtexec --onnx=yolov4_8_3_416_416_static.onnx --batch=8 --explicitBatch --fp16 --workspace=2048 --dumpProfile
[11/05/2020-22:48:40] [I] === Model Options ===
[11/05/2020-22:48:40] [I] Format: ONNX
[11/05/2020-22:48:40] [I] Model: yolov4_8_3_416_416_static.onnx
[11/05/2020-22:48:40] [I] Output:
[11/05/2020-22:48:40] [I] === Build Options ===
[11/05/2020-22:48:40] [I] Max batch: explicit
[11/05/2020-22:48:40] [I] Workspace: 2048 MiB
[11/05/2020-22:48:40] [I] minTiming: 1
[11/05/2020-22:48:40] [I] avgTiming: 8
[11/05/2020-22:48:40] [I] Precision: FP32+FP16
[11/05/2020-22:48:40] [I] Calibration:
[11/05/2020-22:48:40] [I] Refit: Disabled
[11/05/2020-22:48:40] [I] Safe mode: Disabled
[11/05/2020-22:48:40] [I] Save engine:
[11/05/2020-22:48:40] [I] Load engine:
[11/05/2020-22:48:40] [I] Builder Cache: Enabled
[11/05/2020-22:48:40] [I] NVTX verbosity: 0
[11/05/2020-22:48:40] [I] Tactic sources: Using default tactic sources
[11/05/2020-22:48:40] [I] Input(s)s format: fp32:CHW
[11/05/2020-22:48:40] [I] Output(s)s format: fp32:CHW
[11/05/2020-22:48:40] [I] Input build shapes: model
[11/05/2020-22:48:40] [I] Input calibration shapes: model
[11/05/2020-22:48:40] [I] === System Options ===
[11/05/2020-22:48:40] [I] Device: 0
[11/05/2020-22:48:40] [I] DLACore:
[11/05/2020-22:48:40] [I] Plugins:
[11/05/2020-22:48:40] [I] === Inference Options ===
[11/05/2020-22:48:40] [I] Batch: Explicit
[11/05/2020-22:48:40] [I] Input inference shapes: model
[11/05/2020-22:48:40] [I] Iterations: 10
[11/05/2020-22:48:40] [I] Duration: 3s (+ 200ms warm up)
[11/05/2020-22:48:40] [I] Sleep time: 0ms
[11/05/2020-22:48:40] [I] Streams: 1
[11/05/2020-22:48:40] [I] ExposeDMA: Disabled
[11/05/2020-22:48:40] [I] Data transfers: Enabled
[11/05/2020-22:48:40] [I] Spin-wait: Disabled
[11/05/2020-22:48:40] [I] Multithreading: Disabled
[11/05/2020-22:48:40] [I] CUDA Graph: Disabled
[11/05/2020-22:48:40] [I] Separate profiling: Disabled
[11/05/2020-22:48:40] [I] Skip inference: Disabled
[11/05/2020-22:48:40] [I] Inputs:
[11/05/2020-22:48:40] [I] === Reporting Options ===
[11/05/2020-22:48:40] [I] Verbose: Disabled
[11/05/2020-22:48:40] [I] Averages: 10 inferences
[11/05/2020-22:48:40] [I] Percentile: 99
[11/05/2020-22:48:40] [I] Dump refittable layers:Disabled
[11/05/2020-22:48:40] [I] Dump output: Disabled
[11/05/2020-22:48:40] [I] Profile: Enabled
[11/05/2020-22:48:40] [I] Export timing to JSON file:
[11/05/2020-22:48:40] [I] Export output to JSON file:
[11/05/2020-22:48:40] [I] Export profile to JSON file:
[11/05/2020-22:48:40] [I]
[11/05/2020-22:48:40] [I] === Device Information ===
[11/05/2020-22:48:40] [I] Selected Device: Tesla T4
[11/05/2020-22:48:40] [I] Compute Capability: 7.5
[11/05/2020-22:48:40] [I] SMs: 40
[11/05/2020-22:48:40] [I] Compute Clock Rate: 1.59 GHz
[11/05/2020-22:48:40] [I] Device Global Memory: 15109 MiB
[11/05/2020-22:48:40] [I] Shared Memory per SM: 64 KiB
[11/05/2020-22:48:40] [I] Memory Bus Width: 256 bits (ECC enabled)
[11/05/2020-22:48:40] [I] Memory Clock Rate: 5.001 GHz
[11/05/2020-22:48:40] [I]
----------------------------------------------------------------
Input filename:   yolov4_8_3_416_416_static.onnx
ONNX IR version:  0.0.6
Opset version:    11
Producer name:    pytorch
Producer version: 1.7
Domain:
Model version:    0
Doc string:
----------------------------------------------------------------
[11/05/2020-22:48:54] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/05/2020-22:48:54] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/05/2020-22:48:54] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/05/2020-22:48:54] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/05/2020-22:48:54] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/05/2020-22:48:54] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/05/2020-22:48:54] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/05/2020-22:48:54] [W] [TRT] Output type must be INT32 for shape outputs
[11/05/2020-22:48:54] [W] [TRT] Output type must be INT32 for shape outputs
[11/05/2020-22:48:54] [W] [TRT] Output type must be INT32 for shape outputs
[11/05/2020-22:48:54] [W] [TRT] Output type must be INT32 for shape outputs
[11/05/2020-22:50:02] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[11/05/2020-22:57:23] [I] [TRT] Detected 1 inputs and 8 output network tensors.
[11/05/2020-22:57:24] [I] Engine built in 524.519 sec.
[11/05/2020-22:57:24] [I] Starting inference
[11/05/2020-22:57:27] [I] Warmup completed 0 queries over 200 ms
[11/05/2020-22:57:27] [I] Timing trace has 0 queries over 3.06719 s
[11/05/2020-22:57:27] [I] Trace averages of 10 runs:
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 28.9643 ms - Host latency: 33.0101 ms (end to end 33.0231 ms, enqueue 30.4952 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 28.8322 ms - Host latency: 32.8459 ms (end to end 32.8601 ms, enqueue 30.3493 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 29.4449 ms - Host latency: 33.4857 ms (end to end 33.4996 ms, enqueue 30.9698 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 29.0602 ms - Host latency: 33.111 ms (end to end 33.1245 ms, enqueue 30.5909 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 28.9554 ms - Host latency: 32.9936 ms (end to end 33.0062 ms, enqueue 30.4774 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 28.9516 ms - Host latency: 32.9896 ms (end to end 33.0044 ms, enqueue 30.4763 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 28.872 ms - Host latency: 32.9149 ms (end to end 32.9287 ms, enqueue 30.3966 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 28.8063 ms - Host latency: 32.852 ms (end to end 32.8636 ms, enqueue 30.3332 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 29.2182 ms - Host latency: 33.2723 ms (end to end 33.2856 ms, enqueue 30.7474 ms)
[11/05/2020-22:57:27] [I] Average on 10 runs - GPU latency: 29.3361 ms - Host latency: 33.3466 ms (end to end 33.36 ms, enqueue 30.8641 ms)
[11/05/2020-22:57:27] [I] Host Latency
[11/05/2020-22:57:27] [I] min: 32.033 ms (end to end 32.0425 ms)
[11/05/2020-22:57:27] [I] max: 34.8295 ms (end to end 34.846 ms)
[11/05/2020-22:57:27] [I] mean: 33.0822 ms (end to end 33.0956 ms)
[11/05/2020-22:57:27] [I] median: 33.1332 ms (end to end 33.1468 ms)
[11/05/2020-22:57:27] [I] percentile: 34.8295 ms at 99% (end to end 34.846 ms at 99%)
[11/05/2020-22:57:27] [I] throughput: 0 qps
[11/05/2020-22:57:27] [I] walltime: 3.06719 s
[11/05/2020-22:57:27] [I] Enqueue Time
[11/05/2020-22:57:27] [I] min: 29.5044 ms
[11/05/2020-22:57:27] [I] max: 32.3288 ms
[11/05/2020-22:57:27] [I] median: 30.6182 ms
[11/05/2020-22:57:27] [I] GPU Compute
[11/05/2020-22:57:27] [I] min: 27.9836 ms
[11/05/2020-22:57:27] [I] max: 30.7969 ms
[11/05/2020-22:57:27] [I] mean: 29.0441 ms
[11/05/2020-22:57:27] [I] median: 29.0896 ms
[11/05/2020-22:57:27] [I] percentile: 30.7969 ms at 99%
[11/05/2020-22:57:27] [I] total compute time: 2.90441 s

Here’s my batch=8 performace result:

# /usr/src/tensorrt/bin/trtexec --onnx=yolov4_8_3_416_416_static.onnx --explicitBatch --saveEngine=yolov4-hat-8.engine --fp16 --batch=8 --workspace=2048 --dumpProfile

[11/06/2020-01:39:57] [W] [TRT] onnx2trt_utils.cpp:198: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/06/2020-01:40:07] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[11/06/2020-01:48:18] [I] [TRT] Detected 1 inputs and 8 output network tensors.
[11/06/2020-01:48:21] [W] [TRT] Current optimization profile is: 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[11/06/2020-01:48:24] [I] Warmup completed 0 queries over 200 ms
[11/06/2020-01:48:24] [I] Timing trace has 0 queries over 3.06346 s
[11/06/2020-01:48:24] [I] Trace averages of 10 runs:
[11/06/2020-01:48:24] [I] Average on 10 runs - GPU latency: 57.5682 ms - Host latency: 60.6255 ms (end to end 60.6377 ms)
[11/06/2020-01:48:24] [I] Average on 10 runs - GPU latency: 57.7061 ms - Host latency: 60.7643 ms (end to end 60.7749 ms)
[11/06/2020-01:48:24] [I] Average on 10 runs - GPU latency: 56.4765 ms - Host latency: 59.5387 ms (end to end 59.5506 ms)
[11/06/2020-01:48:24] [I] Average on 10 runs - GPU latency: 57.8121 ms - Host latency: 60.8705 ms (end to end 60.8827 ms)
[11/06/2020-01:48:24] [I] Average on 10 runs - GPU latency: 57.0815 ms - Host latency: 60.1399 ms (end to end 60.1529 ms)
[11/06/2020-01:48:24] [I] Host latency
[11/06/2020-01:48:24] [I] min: 58.0643 ms (end to end 58.0767 ms)
[11/06/2020-01:48:24] [I] max: 71.6186 ms (end to end 71.6322 ms)
[11/06/2020-01:48:24] [I] mean: 60.387 ms (end to end 60.399 ms)
[11/06/2020-01:48:24] [I] median: 60.1819 ms (end to end 60.1923 ms)
[11/06/2020-01:48:24] [I] percentile: 71.6186 ms at 99% (end to end 71.6322 ms at 99%)
[11/06/2020-01:48:24] [I] throughput: 0 qps
[11/06/2020-01:48:24] [I] walltime: 3.06346 s
[11/06/2020-01:48:24] [I] GPU Compute
[11/06/2020-01:48:24] [I] min: 55.0073 ms
[11/06/2020-01:48:24] [I] max: 68.5626 ms
[11/06/2020-01:48:24] [I] mean: 57.3292 ms
[11/06/2020-01:48:24] [I] median: 57.1242 ms
[11/06/2020-01:48:24] [I] percentile: 68.5626 ms at 99%
[11/06/2020-01:48:24] [I] total compute time: 2.92379 s

So the FPS is: (1000/57) * 8 = 140.

I’m using darknet version yolov4 weights to generate onnx, not pytorch version.

python demo_darknet2onnx.py cfg/yolov4-hat.cfg yolov4-hat_7000.weights 233.png 8

Maybe the pytorch version implementation is faster?

By the way, I’m using AWS g4dn-2xlarge instance, which contains T4 *1, 8 cpu cores, and 32G memory.

could you share this onnx to us? Give a link to download

Thanks!

Hi @CoderJustin
I tested your onnx, its perf is good on my machine

TensorRT-7.2.1.6/bin$ ./trtexec --onnx=yolov4_8_3_416_416_static.onnx --batch=8 --explicitBatch --fp16 --workspace=2048
&&&& RUNNING TensorRT.trtexec # ./trtexec --onnx=yolov4_8_3_416_416_static.onnx --batch=8 --explicitBatch --fp16 --workspace=2048
[11/06/2020-17:40:50] [I] === Model Options ===
[11/06/2020-17:40:50] [I] Format: ONNX
[11/06/2020-17:40:50] [I] Model: yolov4_8_3_416_416_static.onnx
[11/06/2020-17:40:50] [I] Output:
[11/06/2020-17:40:50] [I] === Build Options ===
[11/06/2020-17:40:50] [I] Max batch: explicit
[11/06/2020-17:40:50] [I] Workspace: 2048 MiB
[11/06/2020-17:40:50] [I] minTiming: 1
[11/06/2020-17:40:50] [I] avgTiming: 8
[11/06/2020-17:40:50] [I] Precision: FP32+FP16
[11/06/2020-17:40:50] [I] Calibration:
[11/06/2020-17:40:50] [I] Refit: Disabled
[11/06/2020-17:40:50] [I] Safe mode: Disabled
[11/06/2020-17:40:50] [I] Save engine:
[11/06/2020-17:40:50] [I] Load engine:
[11/06/2020-17:40:50] [I] Builder Cache: Enabled
[11/06/2020-17:40:50] [I] NVTX verbosity: 0
[11/06/2020-17:40:50] [I] Tactic sources: Using default tactic sources
[11/06/2020-17:40:50] [I] Input(s)s format: fp32:CHW
[11/06/2020-17:40:50] [I] Output(s)s format: fp32:CHW
[11/06/2020-17:40:50] [I] Input build shapes: model
[11/06/2020-17:40:50] [I] Input calibration shapes: model
[11/06/2020-17:40:50] [I] === System Options ===
[11/06/2020-17:40:50] [I] Device: 0
[11/06/2020-17:40:50] [I] DLACore:
[11/06/2020-17:40:50] [I] Plugins:
[11/06/2020-17:40:50] [I] === Inference Options ===
[11/06/2020-17:40:50] [I] Batch: Explicit
[11/06/2020-17:40:50] [I] Input inference shapes: model
[11/06/2020-17:40:50] [I] Iterations: 10
[11/06/2020-17:40:50] [I] Duration: 3s (+ 200ms warm up)
[11/06/2020-17:40:50] [I] Sleep time: 0ms
[11/06/2020-17:40:50] [I] Streams: 1
[11/06/2020-17:40:50] [I] ExposeDMA: Disabled
[11/06/2020-17:40:50] [I] Data transfers: Enabled
[11/06/2020-17:40:50] [I] Spin-wait: Disabled
[11/06/2020-17:40:50] [I] Multithreading: Disabled
[11/06/2020-17:40:50] [I] CUDA Graph: Disabled
[11/06/2020-17:40:50] [I] Separate profiling: Disabled
[11/06/2020-17:40:50] [I] Skip inference: Disabled
[11/06/2020-17:40:50] [I] Inputs:
[11/06/2020-17:40:50] [I] === Reporting Options ===
[11/06/2020-17:40:50] [I] Verbose: Disabled
[11/06/2020-17:40:50] [I] Averages: 10 inferences
[11/06/2020-17:40:50] [I] Percentile: 99
[11/06/2020-17:40:50] [I] Dump refittable layers:Disabled
[11/06/2020-17:40:50] [I] Dump output: Disabled
[11/06/2020-17:40:50] [I] Profile: Disabled
[11/06/2020-17:40:50] [I] Export timing to JSON file:
[11/06/2020-17:40:50] [I] Export output to JSON file:
[11/06/2020-17:40:50] [I] Export profile to JSON file:
[11/06/2020-17:40:50] [I]
[11/06/2020-17:40:50] [I] === Device Information ===
[11/06/2020-17:40:50] [I] Selected Device: Tesla T4
[11/06/2020-17:40:50] [I] Compute Capability: 7.5
[11/06/2020-17:40:50] [I] SMs: 40
[11/06/2020-17:40:50] [I] Compute Clock Rate: 1.59 GHz
[11/06/2020-17:40:50] [I] Device Global Memory: 15109 MiB
[11/06/2020-17:40:50] [I] Shared Memory per SM: 64 KiB
[11/06/2020-17:40:50] [I] Memory Bus Width: 256 bits (ECC enabled)
[11/06/2020-17:40:50] [I] Memory Clock Rate: 5.001 GHz
[11/06/2020-17:40:50] [I]
----------------------------------------------------------------
Input filename:   yolov4_8_3_416_416_static.onnx
ONNX IR version:  0.0.4
Opset version:    11
Producer name:    pytorch
Producer version: 1.3
Domain:
Model version:    0
Doc string:
----------------------------------------------------------------
[11/06/2020-17:41:04] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/06/2020-17:41:04] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/06/2020-17:41:04] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/06/2020-17:41:04] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/06/2020-17:41:04] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/06/2020-17:41:04] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/06/2020-17:41:04] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[11/06/2020-17:41:11] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.0.0
[11/06/2020-17:42:37] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[11/06/2020-17:50:06] [I] [TRT] Detected 1 inputs and 8 output network tensors.
[11/06/2020-17:50:07] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.0.0
[11/06/2020-17:50:07] [I] Engine built in 556.935 sec.
[11/06/2020-17:50:07] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.2.0 but loaded cuBLAS/cuBLAS LT 11.0.0
[11/06/2020-17:50:07] [I] Starting inference
[11/06/2020-17:50:10] [I] Warmup completed 0 queries over 200 ms
[11/06/2020-17:50:10] [I] Timing trace has 0 queries over 3.08235 s
[11/06/2020-17:50:10] [I] Trace averages of 10 runs:
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 26.9358 ms - Host latency: 28.4944 ms (end to end 53.7046 ms, enqueue 4.77418 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.2823 ms - Host latency: 28.8201 ms (end to end 54.3122 ms, enqueue 4.7356 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.8538 ms - Host latency: 29.3901 ms (end to end 55.6408 ms, enqueue 4.73538 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 26.9628 ms - Host latency: 28.5 ms (end to end 53.7159 ms, enqueue 4.74361 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.2722 ms - Host latency: 28.8742 ms (end to end 54.3909 ms, enqueue 4.76619 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.2568 ms - Host latency: 28.7947 ms (end to end 54.2373 ms, enqueue 4.73594 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.3256 ms - Host latency: 28.8802 ms (end to end 54.5973 ms, enqueue 4.75266 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.2072 ms - Host latency: 28.7929 ms (end to end 54.2088 ms, enqueue 4.75657 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.2707 ms - Host latency: 28.9548 ms (end to end 54.4285 ms, enqueue 4.86018 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.403 ms - Host latency: 29.1541 ms (end to end 54.6365 ms, enqueue 4.91628 ms)
[11/06/2020-17:50:10] [I] Average on 10 runs - GPU latency: 27.33 ms - Host latency: 29.0725 ms (end to end 54.492 ms, enqueue 4.90784 ms)
[11/06/2020-17:50:10] [I] Host Latency
[11/06/2020-17:50:10] [I] min: 28.1488 ms (end to end 53.1896 ms)
[11/06/2020-17:50:10] [I] max: 31.9742 ms (end to end 60.3521 ms)
[11/06/2020-17:50:10] [I] mean: 28.8869 ms (end to end 54.3971 ms)
[11/06/2020-17:50:10] [I] median: 28.8578 ms (end to end 54.3659 ms)
[11/06/2020-17:50:10] [I] percentile: 31.5651 ms at 99% (end to end 57.6644 ms at 99%)
[11/06/2020-17:50:10] [I] throughput: 0 qps
[11/06/2020-17:50:10] [I] walltime: 3.08235 s
[11/06/2020-17:50:10] [I] Enqueue Time
[11/06/2020-17:50:10] [I] min: 4.61621 ms
[11/06/2020-17:50:10] [I] max: 4.97241 ms
[11/06/2020-17:50:10] [I] median: 4.75208 ms
[11/06/2020-17:50:10] [I] GPU Compute
[11/06/2020-17:50:10] [I] min: 26.6096 ms
[11/06/2020-17:50:10] [I] max: 30.4394 ms
[11/06/2020-17:50:10] [I] mean: 27.2817 ms
[11/06/2020-17:50:10] [I] median: 27.223 ms
[11/06/2020-17:50:10] [I] percentile: 30.0277 ms at 99%
[11/06/2020-17:50:10] [I] total compute time: 3.05555 s
&&&& PASSED TensorRT.trtexec # ./trtexec --onnx=yolov4_8_3_416_416_static.onnx --batch=8 --explicitBatch --fp16 --workspace=2048

Yeah, maybe the aws instance’s performance is not as good as yours.

what’s the cuda and TRT version on your system?

sorry for late reply, the CUDA is 10.2 and TRT is 7.0.0