DeepStream 5.1, PyTorch, MobileNet SSD v1, retained, ONNX - poor performance

foreverneilyoung · April 16, 2021, 5:46pm

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
Jetson Nano Dev Kit
• DeepStream Version
5.1
• JetPack Version (valid for Jetson only)
4.5
• TensorRT Version
• NVIDIA GPU Driver Version (valid for GPU only)

I’m having followed a tutorial from @dusty_nv which showed, how to do transfer learning with PyTorch. That worked absolutely fine and I was able to downsize the MobileNetv1SSD from 90 classes to 8, fruits only.

EDIT: The tutorial is here jetson-inference/pytorch-ssd.md at master · dusty-nv/jetson-inference · GitHub

While trying to run the resulting ONNX model with DeepStream I managed to pass a couple of hurdles, namely:

The model seems to not able to adapt to different PGIE batch sizes. Since I’m usually working with one to three USB cams I’m permanently switching batch-size in the config file or doing it programmatically based upon the input number. This does not work, so one has to convert the model from PyTorch to ONNX for all supported batch sizes
I was forced to create a working PGIE config. Below is a working example for batch-size=3 (aka three USB cameras):

    [property]
    workspace-size=600
    gpu-id=0
    net-scale-factor=0.003921569790691137
    onnx-file=/home/ubuntu/dragonfly-safety/jetson-inference/models/primary-detector-nano/ssd-mobilenet-b3.onnx
    labelfile-path=/home/ubuntu/dragonfly-safety/jetson-inference/models/primary-detector-nano/labels_onnx.txt
    batch-size=3
    model-color-format=0
    network-mode=2
    num-detected-classes=4
    gie-unique-id=1
    output-blob-names=boxes;scores
    parse-bbox-func-name=NvDsInferParseCustomONNX
    custom-lib-path=/home/ubuntu/dragonfly-safety/jetson-inference/models/primary-detector-nano/libnvdsinfer_custom_impl_onnx.so
    
    [class-attrs-all]
    pre-cluster-threshold=0.5
    eps=0.2
    group-threshold=1

Then it was required to find a bbox parser library, which I was able to stitch together from several Nvidia samples and bring to life with the help of the famous and very helpful @dusty_nv. Thanks for this, Dusty.

EDIT: The lib is here. I published it for community use. GitHub - neilyoung/nvdsinfer_custom_impl_onnx

So now: It works. My first attempt was to make a one-camera inference with an 8 fruit model. That worked with an inference rate of about 29 fps. Then I went up and tested 2 cameras. That showed up at 22 fps. And finally three cameras: 15 fps per camera.

This was absolutely disappointing, since with other models, e.g. resnet10-caffemodel or resnet34-peoplenet I’m achieving full 30 fps per camera inference rate with an input of 30 fps on each camera, MJPEG capture @ 640 x 480.

Until today I had the hope that a further limitation of classes brings the expected boost - but no… I trained a four class model this morning and the inference rates are pretty much identical.

I did run /usr/src/tensorrt/bin/trtexec on all three models and the negative performance was confirmed.

    ubuntu@jetson:~/jetson-inference/python/training/detection/ssd/models/fruit$ /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx
    &&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx
    [04/16/2021-13:51:56] [I] === Model Options ===
    [04/16/2021-13:51:56] [I] Format: ONNX
    [04/16/2021-13:51:56] [I] Model: ssd-mobilenet-b1.onnx
    [04/16/2021-13:51:56] [I] Output:
    [04/16/2021-13:51:56] [I] === Build Options ===
    [04/16/2021-13:51:56] [I] Max batch: 1
    [04/16/2021-13:51:56] [I] Workspace: 16 MB
    [04/16/2021-13:51:56] [I] minTiming: 1
    [04/16/2021-13:51:56] [I] avgTiming: 8
    [04/16/2021-13:51:56] [I] Precision: FP32
    [04/16/2021-13:51:56] [I] Calibration: 
    [04/16/2021-13:51:56] [I] Safe mode: Disabled
    [04/16/2021-13:51:56] [I] Save engine: 
    [04/16/2021-13:51:56] [I] Load engine: 
    [04/16/2021-13:51:56] [I] Builder Cache: Enabled
    [04/16/2021-13:51:56] [I] NVTX verbosity: 0
    [04/16/2021-13:51:56] [I] Inputs format: fp32:CHW
    [04/16/2021-13:51:56] [I] Outputs format: fp32:CHW
    [04/16/2021-13:51:56] [I] Input build shapes: model
    [04/16/2021-13:51:56] [I] Input calibration shapes: model
    [04/16/2021-13:51:56] [I] === System Options ===
    [04/16/2021-13:51:56] [I] Device: 0
    [04/16/2021-13:51:56] [I] DLACore: 
    [04/16/2021-13:51:56] [I] Plugins:
    [04/16/2021-13:51:56] [I] === Inference Options ===
    [04/16/2021-13:51:56] [I] Batch: 1
    [04/16/2021-13:51:56] [I] Input inference shapes: model
    [04/16/2021-13:51:56] [I] Iterations: 10
    [04/16/2021-13:51:56] [I] Duration: 3s (+ 200ms warm up)
    [04/16/2021-13:51:56] [I] Sleep time: 0ms
    [04/16/2021-13:51:56] [I] Streams: 1
    [04/16/2021-13:51:56] [I] ExposeDMA: Disabled
    [04/16/2021-13:51:56] [I] Spin-wait: Disabled
    [04/16/2021-13:51:56] [I] Multithreading: Disabled
    [04/16/2021-13:51:56] [I] CUDA Graph: Disabled
    [04/16/2021-13:51:56] [I] Skip inference: Disabled
    [04/16/2021-13:51:56] [I] Inputs:
    [04/16/2021-13:51:56] [I] === Reporting Options ===
    [04/16/2021-13:51:56] [I] Verbose: Disabled
    [04/16/2021-13:51:56] [I] Averages: 10 inferences
    [04/16/2021-13:51:56] [I] Percentile: 99
    [04/16/2021-13:51:56] [I] Dump output: Disabled
    [04/16/2021-13:51:56] [I] Profile: Disabled
    [04/16/2021-13:51:56] [I] Export timing to JSON file: 
    [04/16/2021-13:51:56] [I] Export output to JSON file: 
    [04/16/2021-13:51:56] [I] Export profile to JSON file: 
    [04/16/2021-13:51:56] [I] 
    ----------------------------------------------------------------
    Input filename:   ssd-mobilenet-b1.onnx
    ONNX IR version:  0.0.6
    Opset version:    9
    Producer name:    pytorch
    Producer version: 1.6
    Domain:           
    Model version:    0
    Doc string:       
    ----------------------------------------------------------------
    [04/16/2021-13:52:00] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
    [04/16/2021-13:52:00] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-13:52:00] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-13:52:00] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-13:53:15] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
    [04/16/2021-13:54:51] [I] [TRT] Detected 1 inputs and 4 output network tensors.
    [04/16/2021-13:54:51] [I] Starting inference threads
    [04/16/2021-13:54:54] [I] Warmup completed 3 queries over 200 ms
    [04/16/2021-13:54:54] [I] Timing trace has 109 queries over 3.03307 s
    [04/16/2021-13:54:54] [I] Trace averages of 10 runs:
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 49.6921 ms - Host latency: 49.9041 ms (end to end 49.9221 ms, enqueue 4.66855 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.2711 ms - Host latency: 25.3944 ms (end to end 25.4074 ms, enqueue 7.01593 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.3403 ms - Host latency: 25.4637 ms (end to end 25.477 ms, enqueue 6.88536 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.4507 ms - Host latency: 25.5739 ms (end to end 25.5871 ms, enqueue 6.80699 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.553 ms - Host latency: 25.676 ms (end to end 25.689 ms, enqueue 7.06737 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.4548 ms - Host latency: 25.5781 ms (end to end 25.5908 ms, enqueue 6.52457 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.567 ms - Host latency: 25.6903 ms (end to end 25.7036 ms, enqueue 5.66252 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.4562 ms - Host latency: 25.5799 ms (end to end 25.5933 ms, enqueue 6.6405 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.4696 ms - Host latency: 25.5925 ms (end to end 25.6055 ms, enqueue 5.97766 ms)
    [04/16/2021-13:54:54] [I] Average on 10 runs - GPU latency: 25.5577 ms - Host latency: 25.6817 ms (end to end 25.6947 ms, enqueue 6.21155 ms)
    [04/16/2021-13:54:54] [I] Host Latency
    [04/16/2021-13:54:54] [I] min: 25.3134 ms (end to end 25.3267 ms)
    [04/16/2021-13:54:54] [I] max: 74.7488 ms (end to end 74.7719 ms)
    [04/16/2021-13:54:54] [I] mean: 27.8122 ms (end to end 27.8257 ms)
    [04/16/2021-13:54:54] [I] median: 25.5723 ms (end to end 25.5858 ms)
    [04/16/2021-13:54:54] [I] percentile: 71.9283 ms at 99% (end to end 71.9514 ms at 99%)
    [04/16/2021-13:54:54] [I] throughput: 35.9372 qps
    [04/16/2021-13:54:54] [I] walltime: 3.03307 s
    [04/16/2021-13:54:54] [I] Enqueue Time
    [04/16/2021-13:54:54] [I] min: 3.26306 ms
    [04/16/2021-13:54:54] [I] max: 9.04639 ms
    [04/16/2021-13:54:54] [I] median: 6.63379 ms
    [04/16/2021-13:54:54] [I] GPU Compute
    [04/16/2021-13:54:54] [I] min: 25.1902 ms
    [04/16/2021-13:54:54] [I] max: 74.4601 ms
    [04/16/2021-13:54:54] [I] mean: 27.6807 ms
    [04/16/2021-13:54:54] [I] median: 25.4481 ms
    [04/16/2021-13:54:54] [I] percentile: 71.6376 ms at 99%
    [04/16/2021-13:54:54] [I] total compute time: 3.0172 s
    &&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx
    ubuntu@jetson:~/jetson-inference/python/training/detection/ssd/models/fruit$ /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b2.onnx
    &&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b2.onnx
    [04/16/2021-13:55:49] [I] === Model Options ===
    [04/16/2021-13:55:49] [I] Format: ONNX
    [04/16/2021-13:55:49] [I] Model: ssd-mobilenet-b2.onnx
    [04/16/2021-13:55:49] [I] Output:
    [04/16/2021-13:55:49] [I] === Build Options ===
    [04/16/2021-13:55:49] [I] Max batch: 1
    [04/16/2021-13:55:49] [I] Workspace: 16 MB
    [04/16/2021-13:55:49] [I] minTiming: 1
    [04/16/2021-13:55:49] [I] avgTiming: 8
    [04/16/2021-13:55:49] [I] Precision: FP32
    [04/16/2021-13:55:49] [I] Calibration: 
    [04/16/2021-13:55:49] [I] Safe mode: Disabled
    [04/16/2021-13:55:49] [I] Save engine: 
    [04/16/2021-13:55:49] [I] Load engine: 
    [04/16/2021-13:55:49] [I] Builder Cache: Enabled
    [04/16/2021-13:55:49] [I] NVTX verbosity: 0
    [04/16/2021-13:55:49] [I] Inputs format: fp32:CHW
    [04/16/2021-13:55:49] [I] Outputs format: fp32:CHW
    [04/16/2021-13:55:49] [I] Input build shapes: model
    [04/16/2021-13:55:49] [I] Input calibration shapes: model
    [04/16/2021-13:55:49] [I] === System Options ===
    [04/16/2021-13:55:49] [I] Device: 0
    [04/16/2021-13:55:49] [I] DLACore: 
    [04/16/2021-13:55:49] [I] Plugins:
    [04/16/2021-13:55:49] [I] === Inference Options ===
    [04/16/2021-13:55:49] [I] Batch: 1
    [04/16/2021-13:55:49] [I] Input inference shapes: model
    [04/16/2021-13:55:49] [I] Iterations: 10
    [04/16/2021-13:55:49] [I] Duration: 3s (+ 200ms warm up)
    [04/16/2021-13:55:49] [I] Sleep time: 0ms
    [04/16/2021-13:55:49] [I] Streams: 1
    [04/16/2021-13:55:49] [I] ExposeDMA: Disabled
    [04/16/2021-13:55:49] [I] Spin-wait: Disabled
    [04/16/2021-13:55:49] [I] Multithreading: Disabled
    [04/16/2021-13:55:49] [I] CUDA Graph: Disabled
    [04/16/2021-13:55:49] [I] Skip inference: Disabled
    [04/16/2021-13:55:49] [I] Inputs:
    [04/16/2021-13:55:49] [I] === Reporting Options ===
    [04/16/2021-13:55:49] [I] Verbose: Disabled
    [04/16/2021-13:55:49] [I] Averages: 10 inferences
    [04/16/2021-13:55:49] [I] Percentile: 99
    [04/16/2021-13:55:49] [I] Dump output: Disabled
    [04/16/2021-13:55:49] [I] Profile: Disabled
    [04/16/2021-13:55:49] [I] Export timing to JSON file: 
    [04/16/2021-13:55:49] [I] Export output to JSON file: 
    [04/16/2021-13:55:49] [I] Export profile to JSON file: 
    [04/16/2021-13:55:49] [I] 
    ----------------------------------------------------------------
    Input filename:   ssd-mobilenet-b2.onnx
    ONNX IR version:  0.0.6
    Opset version:    9
    Producer name:    pytorch
    Producer version: 1.6
    Domain:           
    Model version:    0
    Doc string:       
    ----------------------------------------------------------------
    [04/16/2021-13:55:51] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
    [04/16/2021-13:55:51] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-13:55:51] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-13:55:51] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-13:57:30] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
    [04/16/2021-13:59:33] [I] [TRT] Detected 1 inputs and 4 output network tensors.
    [04/16/2021-13:59:33] [I] Starting inference threads
    [04/16/2021-13:59:37] [I] Warmup completed 2 queries over 200 ms
    [04/16/2021-13:59:37] [I] Timing trace has 62 queries over 3.07764 s
    [04/16/2021-13:59:37] [I] Trace averages of 10 runs:
    [04/16/2021-13:59:37] [I] Average on 10 runs - GPU latency: 53.5355 ms - Host latency: 53.7964 ms (end to end 53.81 ms, enqueue 4.40636 ms)
    [04/16/2021-13:59:37] [I] Average on 10 runs - GPU latency: 48.614 ms - Host latency: 48.8524 ms (end to end 48.8655 ms, enqueue 4.14586 ms)
    [04/16/2021-13:59:37] [I] Average on 10 runs - GPU latency: 48.6112 ms - Host latency: 48.8508 ms (end to end 48.8639 ms, enqueue 4.29144 ms)
    [04/16/2021-13:59:37] [I] Average on 10 runs - GPU latency: 48.575 ms - Host latency: 48.8144 ms (end to end 48.8277 ms, enqueue 8.96932 ms)
    [04/16/2021-13:59:37] [I] Average on 10 runs - GPU latency: 48.5648 ms - Host latency: 48.804 ms (end to end 48.817 ms, enqueue 7.49756 ms)
    [04/16/2021-13:59:37] [I] Average on 10 runs - GPU latency: 48.5517 ms - Host latency: 48.7905 ms (end to end 48.8037 ms, enqueue 9.36379 ms)
    [04/16/2021-13:59:37] [I] Host Latency
    [04/16/2021-13:59:37] [I] min: 48.7 ms (end to end 48.7134 ms)
    [04/16/2021-13:59:37] [I] max: 87.0511 ms (end to end 87.0677 ms)
    [04/16/2021-13:59:37] [I] mean: 49.6255 ms (end to end 49.6387 ms)
    [04/16/2021-13:59:37] [I] median: 48.8181 ms (end to end 48.8309 ms)
    [04/16/2021-13:59:37] [I] percentile: 87.0511 ms at 99% (end to end 87.0677 ms at 99%)
    [04/16/2021-13:59:37] [I] throughput: 20.1453 qps
    [04/16/2021-13:59:37] [I] walltime: 3.07764 s
    [04/16/2021-13:59:37] [I] Enqueue Time
    [04/16/2021-13:59:37] [I] min: 3.65649 ms
    [04/16/2021-13:59:37] [I] max: 11.1689 ms
    [04/16/2021-13:59:37] [I] median: 4.8799 ms
    [04/16/2021-13:59:37] [I] GPU Compute
    [04/16/2021-13:59:37] [I] min: 48.4604 ms
    [04/16/2021-13:59:37] [I] max: 86.6143 ms
    [04/16/2021-13:59:37] [I] mean: 49.3829 ms
    [04/16/2021-13:59:37] [I] median: 48.5784 ms
    [04/16/2021-13:59:37] [I] percentile: 86.6143 ms at 99%
    [04/16/2021-13:59:37] [I] total compute time: 3.06174 s
    &&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b2.onnx
    ubuntu@jetson:~/jetson-inference/python/training/detection/ssd/models/fruit$ /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b3.onnx
    &&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b3.onnx
    [04/16/2021-14:00:12] [I] === Model Options ===
    [04/16/2021-14:00:12] [I] Format: ONNX
    [04/16/2021-14:00:12] [I] Model: ssd-mobilenet-b3.onnx
    [04/16/2021-14:00:12] [I] Output:
    [04/16/2021-14:00:12] [I] === Build Options ===
    [04/16/2021-14:00:12] [I] Max batch: 1
    [04/16/2021-14:00:12] [I] Workspace: 16 MB
    [04/16/2021-14:00:12] [I] minTiming: 1
    [04/16/2021-14:00:12] [I] avgTiming: 8
    [04/16/2021-14:00:12] [I] Precision: FP32
    [04/16/2021-14:00:12] [I] Calibration: 
    [04/16/2021-14:00:12] [I] Safe mode: Disabled
    [04/16/2021-14:00:12] [I] Save engine: 
    [04/16/2021-14:00:12] [I] Load engine: 
    [04/16/2021-14:00:12] [I] Builder Cache: Enabled
    [04/16/2021-14:00:12] [I] NVTX verbosity: 0
    [04/16/2021-14:00:12] [I] Inputs format: fp32:CHW
    [04/16/2021-14:00:12] [I] Outputs format: fp32:CHW
    [04/16/2021-14:00:12] [I] Input build shapes: model
    [04/16/2021-14:00:12] [I] Input calibration shapes: model
    [04/16/2021-14:00:12] [I] === System Options ===
    [04/16/2021-14:00:12] [I] Device: 0
    [04/16/2021-14:00:12] [I] DLACore: 
    [04/16/2021-14:00:12] [I] Plugins:
    [04/16/2021-14:00:12] [I] === Inference Options ===
    [04/16/2021-14:00:12] [I] Batch: 1
    [04/16/2021-14:00:12] [I] Input inference shapes: model
    [04/16/2021-14:00:12] [I] Iterations: 10
    [04/16/2021-14:00:12] [I] Duration: 3s (+ 200ms warm up)
    [04/16/2021-14:00:12] [I] Sleep time: 0ms
    [04/16/2021-14:00:12] [I] Streams: 1
    [04/16/2021-14:00:12] [I] ExposeDMA: Disabled
    [04/16/2021-14:00:12] [I] Spin-wait: Disabled
    [04/16/2021-14:00:12] [I] Multithreading: Disabled
    [04/16/2021-14:00:12] [I] CUDA Graph: Disabled
    [04/16/2021-14:00:12] [I] Skip inference: Disabled
    [04/16/2021-14:00:12] [I] Inputs:
    [04/16/2021-14:00:12] [I] === Reporting Options ===
    [04/16/2021-14:00:12] [I] Verbose: Disabled
    [04/16/2021-14:00:12] [I] Averages: 10 inferences
    [04/16/2021-14:00:12] [I] Percentile: 99
    [04/16/2021-14:00:12] [I] Dump output: Disabled
    [04/16/2021-14:00:12] [I] Profile: Disabled
    [04/16/2021-14:00:12] [I] Export timing to JSON file: 
    [04/16/2021-14:00:12] [I] Export output to JSON file: 
    [04/16/2021-14:00:12] [I] Export profile to JSON file: 
    [04/16/2021-14:00:12] [I] 
    ----------------------------------------------------------------
    Input filename:   ssd-mobilenet-b3.onnx
    ONNX IR version:  0.0.6
    Opset version:    9
    Producer name:    pytorch
    Producer version: 1.6
    Domain:           
    Model version:    0
    Doc string:       
    ----------------------------------------------------------------
    [04/16/2021-14:00:14] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
    [04/16/2021-14:00:14] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-14:00:14] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-14:00:14] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
    [04/16/2021-14:01:47] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
    [04/16/2021-14:04:04] [I] [TRT] Detected 1 inputs and 4 output network tensors.
    [04/16/2021-14:04:04] [I] Starting inference threads
    [04/16/2021-14:04:07] [I] Warmup completed 3 queries over 200 ms
    [04/16/2021-14:04:07] [I] Timing trace has 43 queries over 3.12521 s
    [04/16/2021-14:04:07] [I] Trace averages of 10 runs:
    [04/16/2021-14:04:07] [I] Average on 10 runs - GPU latency: 71.8557 ms - Host latency: 72.2121 ms (end to end 72.2254 ms, enqueue 3.636 ms)
    [04/16/2021-14:04:07] [I] Average on 10 runs - GPU latency: 73.7693 ms - Host latency: 74.1263 ms (end to end 74.1396 ms, enqueue 3.61756 ms)
    [04/16/2021-14:04:07] [I] Average on 10 runs - GPU latency: 71.8727 ms - Host latency: 72.227 ms (end to end 72.2399 ms, enqueue 3.64722 ms)
    [04/16/2021-14:04:07] [I] Average on 10 runs - GPU latency: 71.8787 ms - Host latency: 72.2331 ms (end to end 72.2455 ms, enqueue 3.5853 ms)
    [04/16/2021-14:04:07] [I] Host Latency
    [04/16/2021-14:04:07] [I] min: 72.0574 ms (end to end 72.0706 ms)
    [04/16/2021-14:04:07] [I] max: 82.2065 ms (end to end 82.2222 ms)
    [04/16/2021-14:04:07] [I] mean: 72.6658 ms (end to end 72.6786 ms)
    [04/16/2021-14:04:07] [I] median: 72.2212 ms (end to end 72.2349 ms)
    [04/16/2021-14:04:07] [I] percentile: 82.2065 ms at 99% (end to end 82.2222 ms at 99%)
    [04/16/2021-14:04:07] [I] throughput: 13.7591 qps
    [04/16/2021-14:04:07] [I] walltime: 3.12521 s
    [04/16/2021-14:04:07] [I] Enqueue Time
    [04/16/2021-14:04:07] [I] min: 3.4043 ms
    [04/16/2021-14:04:07] [I] max: 3.91513 ms
    [04/16/2021-14:04:07] [I] median: 3.60901 ms
    [04/16/2021-14:04:07] [I] GPU Compute
    [04/16/2021-14:04:07] [I] min: 71.7029 ms
    [04/16/2021-14:04:07] [I] max: 81.8503 ms
    [04/16/2021-14:04:07] [I] mean: 72.3103 ms
    [04/16/2021-14:04:07] [I] median: 71.8657 ms
    [04/16/2021-14:04:07] [I] percentile: 81.8503 ms at 99%
    [04/16/2021-14:04:07] [I] total compute time: 3.10934 s
    &&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b3.onnx
    ubuntu@jetson:~/jetson-inference/python/training/detection/ssd/models/fruit$

25 ms, 47 ms and 73 ms. As far as I understand this is the end to end time per inference (?)

But it would confirm my own practical experiences with the cameras.

The question is - apart from using TRT: Is there some potential to boost that results?

AastaLLL · April 19, 2021, 3:22am

Hi,

You can also get a better result if running the TensorRT with fp16 mode.
More, since deeptream support tracker, you can leverage it by applying the inference with interval=2.

Thanks.

foreverneilyoung · April 19, 2021, 9:24am

I thought I would already use FP16.

Interval = 2 doesn’t improve the things. I mean how should it? The mean time per image is about 60 ms per inference, which IMHO renders to 15 fps

AastaLLL · April 23, 2021, 6:26am

Hi,

Based on the benchmark result above, we can reach around 43 fps for SSD Mobilenet-V1.
Not sure if you already do this, but you can boost Nano into performance mode with following command:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

More, set interval can leverage the functionality of tracker.
In most of case, you don’t need to apply the detection to every single frame.
Instead, you can do it periodically and predicate the intermediate result with a low-weight tracker.

Thanks.

foreverneilyoung · April 23, 2021, 4:00pm

Yes, I have that jetson clocks and stuff. I wouldn’t know how “interval” (to what value set?) could help.

AastaLLL · April 26, 2021, 6:04am

Hi,

You can find more details in our document below:
https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_plugin_gst-nvinfer.html

It mainly indicates the number of consecutive batches to be skipped for inference.
Thanks.

foreverneilyoung · April 26, 2021, 6:23am

I set interval=2, wouldn’t that mean, that it skips every second frame? Would that parameter just be recognized in conjunction with nvtracker? (I don’t use nvtracker)

AastaLLL · April 29, 2021, 4:29am

Hi,

The intermediate value can be predict by our tracker.
And it is much more light-weight.

Is this acceptable?

Thanks.