When to use NvDsPreprocess with multiple ROIs?

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) GPU
• DeepStream Version 6.2
• TensorRT Version 8.5.2.2
• NVIDIA GPU Driver Version (valid for GPU only) 525.125.06
• Issue Type( questions, new requirements, bugs)
I wanted Nvidia’s opinion about optimization and NvDsPreprocess plugin usage:

  1. I have an area of 3840x962 which I resize to 640x640 and feed to yoloV7 for primary inference. YoloV7 batch-size=1
  2. I can split the same area into 3 ROIs, roughly the same size and feed them to yoloV7 model with batch-size=3 I use NvDsPreprocess plugin for this
  3. I use 3 sources (video inputs) with custom-written code that cuts out provided ROI and feeds it to streammux down the pipeline. YoloV7 batch-size=3

I ran a little experiment, to compare max FPS in deepstream:

  1. 1st setup - AVG 17 FPS
  2. 2nd setup - AVG 10 FPS
  3. 3rd setup - AVG 15 FPS

Note: all in all, FPS may seem very low, that is because I use additional custom plugin that saves full frames to redis.
Question 1: why using NvDsPreprocess plugin gave worst results in terms of performance? With 1st setup, batch-size=1, but the frame itself is large. With 2nd (and 3rd) setups, batch-size=3, but frames are much smaller due to 3 ROIs. Does feeding multiple frames for inference to yoloV7 is more costly in terms of performance and outweigh having smaller frames?
Question 2: would you recommend feeding narrow frames (like 3840x962) to object detection model or square-shaped frames, like with 3 ROIs? This is the reason I tried the 3 ROIs with NvDsPreprocess, but the performance drop was a bit too much.

1, FPS represents the performance of the whole pipeline. please use this method measure the latency of the pipeline comonents.
2. NvDsPreprocess and nvinfer support scaling with padding, please find maintain-aspect-ratio and symmetric-padding in nvinfer doc and nvdspreprocess doc.

In general, when should I use ROI functionality and not just infer full frame?

AYK, ROI means region of interest. it depends on requirements. sometimes the users only want to do inference on the ROIs. pelase refer to nvdspreprocess doc.

I tested the latency of the pipeline and its plugins:

  • NvDsPreprocess disabled, streammux 3840x962
    ** Nvinfer plugin latency 29-107ms
    ** Frame latency 279-336ms

  • NvDsPreprocess enabled, 3 ROIs of size 1280x962, batch-size=3
    ** NvDsPreprocess plugin latency 9-20ms
    ** Nvinfer plugin latency 188-201ms
    ** Frame latency 680-780ms

  • NvDsPreprocess enabled, 1 ROIs of size 3840x962
    ** NvDsPreprocess plugin latency 1-3ms
    ** Nvinfer plugin latency 70-90ms
    ** Frame latency 300-330ms

  1. NvDsPreprocess plugin latency is very minimal, so does that mean that resizing images for inference uses minimal resources?
  2. I though that using NvdsPreprocess would allow so increase pipeline speed, because no streammux reshape is done and PGIE model receives prepared tensors. But as you can see, 3 ROIs gives the worst performance. Is it purely due to 3 images being passed to PGIE instead of 1, and all the resizing/preprocessing that happens in the pipeline is trivial?
  3. nvv4l2decoder0 plugin latency increases when using multiple ROIs. Why?

Here are some questions to analyze:
Is source type local file or RTSP? we suggest using local file if testing performance.
where did you get the yolov7 model? could you elaborate on “with custom-written code that cuts out provided ROI”?
can you share the three whole media pipeline? to narrow down the issue, you can use fakesink.

  1. the plugin will use GPU acceleration to do resizing.
  2. from the data, 3 ROIs 's nvinfer will use more time. please narrow down this issue by the following methods:
    a> nvinfer plugin and low level lib are opensource, please check if postprocess will cost much time. please refer to InferPostprocessor::postProcessHost in nvdsinfer_context_impl.cpp. did you do acceleration in parsing box function? please refer to this code.
  3. could you share the test data? can you use deepstream sample deepstream-preprocess-test to reproduce this issue? maybe nvv4l2decoder0 will wait if downstream dose not return the buffers.

I did use local file for testing.
I use custom trained yolov7 from this repo. I export the trained pth model to onnx using this repo. I also use the nvinfer custom lib from the same repo. Although I use the smallest weights yolov8.pt for initial training, the trained model weights file (converted to onnx) is around 140MB, converted to engine - around 90 MB, so it makes me think it’s quite a heavy weight model.

To address your points:
2. I am not sure how I should measure the posprocess cost in InferPostprocessor::postProcessHost. Regarding the accelerated bbox parsing, as I mentioned I use the custom nvinfer lib from here, and it also provided cuda bbox parsing function, but it gives no major performance improvement. I will try the NVIDIA-AI-IOT/yolo_deepstream version you suggested.
3. I used deepstream-proprocess-test with my yolov7 model and video provided by deepstream (sample_1080p_h264.mp4). When i Have 1 ROI 3840x962, the video is processed quite fast. However, If I use 3 ROIs 1280x962, I get this:
image
The output video has really low FPS.
Do you think the issue is with my yolov7 model? When I run the sample app with the default resnet10.caffemodel, the performance is just fine.
Also, I enabled the envs for latency measurement in deepstream-test-app, but there’s no latency output, also no FPS output. Is there a way I could see the latency and FPS?

I also ran trtexec --loadEngine=model_b1_gpu0_fp16.engine on my yolov7 model.
In the output, it specifies that model precision if FP32, although I generate the engine with network-mode=2 (fp16). Perhaps this is the cause of the model being so heavy?
Here is the full log:

&&& RUNNING TensorRT.trtexec [TensorRT v8502] # trtexec --loadEngine=model_b1_gpu0_fp16.engine
[09/29/2023-08:14:46] [I] === Model Options ===
[09/29/2023-08:14:46] [I] Format: *
[09/29/2023-08:14:46] [I] Model: 
[09/29/2023-08:14:46] [I] Output:
[09/29/2023-08:14:46] [I] === Build Options ===
[09/29/2023-08:14:46] [I] Max batch: 1
[09/29/2023-08:14:46] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[09/29/2023-08:14:46] [I] minTiming: 1
[09/29/2023-08:14:46] [I] avgTiming: 8
[09/29/2023-08:14:46] [I] Precision: FP32
[09/29/2023-08:14:46] [I] LayerPrecisions: 
[09/29/2023-08:14:46] [I] Calibration: 
[09/29/2023-08:14:46] [I] Refit: Disabled
[09/29/2023-08:14:46] [I] Sparsity: Disabled
[09/29/2023-08:14:46] [I] Safe mode: Disabled
[09/29/2023-08:14:46] [I] DirectIO mode: Disabled
[09/29/2023-08:14:46] [I] Restricted mode: Disabled
[09/29/2023-08:14:46] [I] Build only: Disabled
[09/29/2023-08:14:46] [I] Save engine: 
[09/29/2023-08:14:46] [I] Load engine: model_b1_gpu0_fp16.engine
[09/29/2023-08:14:46] [I] Profiling verbosity: 0
[09/29/2023-08:14:46] [I] Tactic sources: Using default tactic sources
[09/29/2023-08:14:46] [I] timingCacheMode: local
[09/29/2023-08:14:46] [I] timingCacheFile: 
[09/29/2023-08:14:46] [I] Heuristic: Disabled
[09/29/2023-08:14:46] [I] Preview Features: Use default preview flags.
[09/29/2023-08:14:46] [I] Input(s)s format: fp32:CHW
[09/29/2023-08:14:46] [I] Output(s)s format: fp32:CHW
[09/29/2023-08:14:46] [I] Input build shapes: model
[09/29/2023-08:14:46] [I] Input calibration shapes: model
[09/29/2023-08:14:46] [I] === System Options ===
[09/29/2023-08:14:46] [I] Device: 0
[09/29/2023-08:14:46] [I] DLACore: 
[09/29/2023-08:14:46] [I] Plugins:
[09/29/2023-08:14:46] [I] === Inference Options ===
[09/29/2023-08:14:46] [I] Batch: 1
[09/29/2023-08:14:46] [I] Input inference shapes: model
[09/29/2023-08:14:46] [I] Iterations: 10
[09/29/2023-08:14:46] [I] Duration: 3s (+ 200ms warm up)
[09/29/2023-08:14:46] [I] Sleep time: 0ms
[09/29/2023-08:14:46] [I] Idle time: 0ms
[09/29/2023-08:14:46] [I] Streams: 1
[09/29/2023-08:14:46] [I] ExposeDMA: Disabled
[09/29/2023-08:14:46] [I] Data transfers: Enabled
[09/29/2023-08:14:46] [I] Spin-wait: Disabled
[09/29/2023-08:14:46] [I] Multithreading: Disabled
[09/29/2023-08:14:46] [I] CUDA Graph: Disabled
[09/29/2023-08:14:46] [I] Separate profiling: Disabled
[09/29/2023-08:14:46] [I] Time Deserialize: Disabled
[09/29/2023-08:14:46] [I] Time Refit: Disabled
[09/29/2023-08:14:46] [I] NVTX verbosity: 0
[09/29/2023-08:14:46] [I] Persistent Cache Ratio: 0
[09/29/2023-08:14:46] [I] Inputs:
[09/29/2023-08:14:46] [I] === Reporting Options ===
[09/29/2023-08:14:46] [I] Verbose: Disabled
[09/29/2023-08:14:46] [I] Averages: 10 inferences
[09/29/2023-08:14:46] [I] Percentiles: 90,95,99
[09/29/2023-08:14:46] [I] Dump refittable layers:Disabled
[09/29/2023-08:14:46] [I] Dump output: Disabled
[09/29/2023-08:14:46] [I] Profile: Disabled
[09/29/2023-08:14:46] [I] Export timing to JSON file: 
[09/29/2023-08:14:46] [I] Export output to JSON file: 
[09/29/2023-08:14:46] [I] Export profile to JSON file: 
[09/29/2023-08:14:46] [I] 
[09/29/2023-08:14:46] [I] === Device Information ===
[09/29/2023-08:14:46] [I] Selected Device: NVIDIA GeForce GTX 1650
[09/29/2023-08:14:46] [I] Compute Capability: 7.5
[09/29/2023-08:14:46] [I] SMs: 16
[09/29/2023-08:14:46] [I] Compute Clock Rate: 1.56 GHz
[09/29/2023-08:14:46] [I] Device Global Memory: 3903 MiB
[09/29/2023-08:14:46] [I] Shared Memory per SM: 64 KiB
[09/29/2023-08:14:46] [I] Memory Bus Width: 128 bits (ECC disabled)
[09/29/2023-08:14:46] [I] Memory Clock Rate: 4.001 GHz
[09/29/2023-08:14:46] [I] 
[09/29/2023-08:14:46] [I] TensorRT version: 8.5.2
[09/29/2023-08:14:46] [I] Engine loaded in 0.0748545 sec.
[09/29/2023-08:14:47] [I] [TRT] Loaded engine size: 78 MiB
[09/29/2023-08:14:47] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +82, now: CPU 0, GPU 82 (MiB)
[09/29/2023-08:14:47] [I] Engine deserialized in 0.902906 sec.
[09/29/2023-08:14:47] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[09/29/2023-08:14:47] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +62, now: CPU 0, GPU 144 (MiB)
[09/29/2023-08:14:47] [I] Setting persistentCacheLimit to 0 bytes.
[09/29/2023-08:14:47] [I] Using random values for input input
[09/29/2023-08:14:47] [I] Created input binding for input with dimensions 1x3x640x640
[09/29/2023-08:14:47] [I] Using random values for output boxes
[09/29/2023-08:14:47] [I] Created output binding for boxes with dimensions 1x25200x4
[09/29/2023-08:14:47] [I] Using random values for output scores
[09/29/2023-08:14:47] [I] Created output binding for scores with dimensions 1x25200x1
[09/29/2023-08:14:47] [I] Using random values for output classes
[09/29/2023-08:14:47] [I] Created output binding for classes with dimensions 1x25200x1
[09/29/2023-08:14:47] [I] Starting inference
[09/29/2023-08:14:50] [I] Warmup completed 5 queries over 200 ms
[09/29/2023-08:14:50] [I] Timing trace has 74 queries over 3.12276 s
[09/29/2023-08:14:50] [I] 
[09/29/2023-08:14:50] [I] === Trace details ===
[09/29/2023-08:14:50] [I] Trace averages of 10 runs:
[09/29/2023-08:14:50] [I] Average on 10 runs - GPU latency: 41.4413 ms - Host latency: 41.918 ms (enqueue 1.55435 ms)
[09/29/2023-08:14:50] [I] Average on 10 runs - GPU latency: 42.887 ms - Host latency: 43.3493 ms (enqueue 0.666095 ms)
[09/29/2023-08:14:50] [I] Average on 10 runs - GPU latency: 42.7747 ms - Host latency: 43.2411 ms (enqueue 0.729218 ms)
[09/29/2023-08:14:50] [I] Average on 10 runs - GPU latency: 41.8043 ms - Host latency: 42.2783 ms (enqueue 1.41774 ms)
[09/29/2023-08:14:50] [I] Average on 10 runs - GPU latency: 41.2208 ms - Host latency: 41.7014 ms (enqueue 1.96719 ms)
[09/29/2023-08:14:50] [I] Average on 10 runs - GPU latency: 40.5666 ms - Host latency: 41.0506 ms (enqueue 2.1832 ms)
[09/29/2023-08:14:50] [I] Average on 10 runs - GPU latency: 40.7761 ms - Host latency: 41.2634 ms (enqueue 2.16587 ms)
[09/29/2023-08:14:50] [I] 
[09/29/2023-08:14:50] [I] === Performance summary ===
[09/29/2023-08:14:50] [I] Throughput: 23.697 qps
[09/29/2023-08:14:50] [I] Latency: min = 38.28 ms, max = 47.5454 ms, mean = 42.0977 ms, median = 41.7066 ms, percentile(90%) = 44.9677 ms, percentile(95%) = 46.0984 ms, percentile(99%) = 47.5454 ms
[09/29/2023-08:14:50] [I] Enqueue Time: min = 0.611572 ms, max = 2.50146 ms, mean = 1.55336 ms, median = 1.96292 ms, percentile(90%) = 2.20874 ms, percentile(95%) = 2.44141 ms, percentile(99%) = 2.50146 ms
[09/29/2023-08:14:50] [I] H2D Latency: min = 0.403931 ms, max = 0.437256 ms, mean = 0.418273 ms, median = 0.418945 ms, percentile(90%) = 0.427734 ms, percentile(95%) = 0.434814 ms, percentile(99%) = 0.437256 ms
[09/29/2023-08:14:50] [I] GPU Compute Time: min = 37.807 ms, max = 47.0848 ms, mean = 41.6217 ms, median = 41.238 ms, percentile(90%) = 44.5071 ms, percentile(95%) = 45.6377 ms, percentile(99%) = 47.0848 ms
[09/29/2023-08:14:50] [W] * GPU compute time is unstable, with coefficient of variance = 4.82119%.
[09/29/2023-08:14:50] [I] D2H Latency: min = 0.048584 ms, max = 0.0698242 ms, mean = 0.0577683 ms, median = 0.0539551 ms, percentile(90%) = 0.0673828 ms, percentile(95%) = 0.0681152 ms, percentile(99%) = 0.0698242 ms
[09/29/2023-08:14:50] [I] Total Host Walltime: 3.12276 s
[09/29/2023-08:14:50] [I] Total GPU Compute Time: 3.08001 s
[09/29/2023-08:14:50] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/29/2023-08:14:50] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[09/29/2023-08:14:50] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # trtexec --loadEngine=model_b1_gpu0_fp16.engine

  1. parsing box is the a step of InferPostprocessor::postProcessHost, you can add log in NvDsInferParseYolo to measure the time consumption.
  2. as the configuration file config_infer_primary_yoloV7.txt shown, cuda accelearation version NvDsInferParseYoloCuda is diabled. please make sure it is enabled.

you can use fakesink if testing performance.

yes, yolov7 is more complex , you can use “trtexec --loadEngine=saved.engine --fp16” to test.

if using a custom code, please refer to the point 2 in this method.

  1. In trtexec --loadEngine=saved.engine output, does Throughput: 81.1558 qps refer to the max FPS I could get using this engine? Could you elaborate more about what qps is?

Here are the results for a 640x640 yoloV7 engine with different precisions:

  • fp32 - 24.1385 qps

  • fp16 - 48.4616

  • int8 - 81.1158 qps

It all makes sense - lower precision gives more throughput to the engine. However, when I run the engine in deepstream, I do not reach these precisions, I assume this is due to additional overhead of the deepstream pipeline.

  1. Running yolov4 on deepstream gave me around 140 FPS. When I tried to inspect the engine with trtexec, it gave me an error:
[10/03/2023-05:24:21] [I] TensorRT version: 8.5.2
[10/03/2023-05:24:21] [I] Engine loaded in 0.226273 sec.
[10/03/2023-05:24:22] [I] [TRT] Loaded engine size: 156 MiB
[10/03/2023-05:24:23] [E] Error[1]: [pluginV2Runner.cpp::load::300] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # trtexec --loadEngine=model_b1_gpu0_fp16.engine
[10/03/2023-05:24:23] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::66] Error Code 4: Internal Error (Engine deserialization failed.)
[10/03/2023-05:24:23] [E] Engine deserialization failed
[10/03/2023-05:24:23] [E] Got invalid engine!

Perhaps you have any idea why this happens?

  1. I have two devices: GeForce GTX 1650 and Jetson Orin AGX 64GB and running the trtexec --loadEngine gives me similar throughput of ~44 qps.** How is that possible? Isn’t the Orin device much more capable based on these benchmarks?
    Here is the full output of the trtexec on Orin:
trtexec --loadEngine=model_b1_gpu0_fp16.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp16.engine
[10/03/2023-05:31:05] [I] === Model Options ===
[10/03/2023-05:31:05] [I] Format: *
[10/03/2023-05:31:05] [I] Model: 
[10/03/2023-05:31:05] [I] Output:
[10/03/2023-05:31:05] [I] === Build Options ===
[10/03/2023-05:31:05] [I] Max batch: 1
[10/03/2023-05:31:05] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/03/2023-05:31:05] [I] minTiming: 1
[10/03/2023-05:31:05] [I] avgTiming: 8
[10/03/2023-05:31:05] [I] Precision: FP32
[10/03/2023-05:31:05] [I] LayerPrecisions: 
[10/03/2023-05:31:05] [I] Calibration: 
[10/03/2023-05:31:05] [I] Refit: Disabled
[10/03/2023-05:31:05] [I] Sparsity: Disabled
[10/03/2023-05:31:05] [I] Safe mode: Disabled
[10/03/2023-05:31:05] [I] DirectIO mode: Disabled
[10/03/2023-05:31:05] [I] Restricted mode: Disabled
[10/03/2023-05:31:05] [I] Build only: Disabled
[10/03/2023-05:31:05] [I] Save engine: 
[10/03/2023-05:31:05] [I] Load engine: model_b1_gpu0_fp16.engine
[10/03/2023-05:31:05] [I] Profiling verbosity: 0
[10/03/2023-05:31:05] [I] Tactic sources: Using default tactic sources
[10/03/2023-05:31:05] [I] timingCacheMode: local
[10/03/2023-05:31:05] [I] timingCacheFile: 
[10/03/2023-05:31:05] [I] Heuristic: Disabled
[10/03/2023-05:31:05] [I] Preview Features: Use default preview flags.
[10/03/2023-05:31:05] [I] Input(s)s format: fp32:CHW
[10/03/2023-05:31:05] [I] Output(s)s format: fp32:CHW
[10/03/2023-05:31:05] [I] Input build shapes: model
[10/03/2023-05:31:05] [I] Input calibration shapes: model
[10/03/2023-05:31:05] [I] === System Options ===
[10/03/2023-05:31:05] [I] Device: 0
[10/03/2023-05:31:05] [I] DLACore: 
[10/03/2023-05:31:05] [I] Plugins:
[10/03/2023-05:31:05] [I] === Inference Options ===
[10/03/2023-05:31:05] [I] Batch: 1
[10/03/2023-05:31:05] [I] Input inference shapes: model
[10/03/2023-05:31:05] [I] Iterations: 10
[10/03/2023-05:31:05] [I] Duration: 3s (+ 200ms warm up)
[10/03/2023-05:31:05] [I] Sleep time: 0ms
[10/03/2023-05:31:05] [I] Idle time: 0ms
[10/03/2023-05:31:05] [I] Streams: 1
[10/03/2023-05:31:05] [I] ExposeDMA: Disabled
[10/03/2023-05:31:05] [I] Data transfers: Enabled
[10/03/2023-05:31:05] [I] Spin-wait: Disabled
[10/03/2023-05:31:05] [I] Multithreading: Disabled
[10/03/2023-05:31:05] [I] CUDA Graph: Disabled
[10/03/2023-05:31:05] [I] Separate profiling: Disabled
[10/03/2023-05:31:05] [I] Time Deserialize: Disabled
[10/03/2023-05:31:05] [I] Time Refit: Disabled
[10/03/2023-05:31:05] [I] NVTX verbosity: 0
[10/03/2023-05:31:05] [I] Persistent Cache Ratio: 0
[10/03/2023-05:31:05] [I] Inputs:
[10/03/2023-05:31:05] [I] === Reporting Options ===
[10/03/2023-05:31:05] [I] Verbose: Disabled
[10/03/2023-05:31:05] [I] Averages: 10 inferences
[10/03/2023-05:31:05] [I] Percentiles: 90,95,99
[10/03/2023-05:31:05] [I] Dump refittable layers:Disabled
[10/03/2023-05:31:05] [I] Dump output: Disabled
[10/03/2023-05:31:05] [I] Profile: Disabled
[10/03/2023-05:31:05] [I] Export timing to JSON file: 
[10/03/2023-05:31:05] [I] Export output to JSON file: 
[10/03/2023-05:31:05] [I] Export profile to JSON file: 
[10/03/2023-05:31:05] [I] 
[10/03/2023-05:31:05] [I] === Device Information ===
[10/03/2023-05:31:05] [I] Selected Device: Orin
[10/03/2023-05:31:05] [I] Compute Capability: 8.7
[10/03/2023-05:31:05] [I] SMs: 8
[10/03/2023-05:31:05] [I] Compute Clock Rate: 1.3 GHz
[10/03/2023-05:31:05] [I] Device Global Memory: 62796 MiB
[10/03/2023-05:31:05] [I] Shared Memory per SM: 164 KiB
[10/03/2023-05:31:05] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/03/2023-05:31:05] [I] Memory Clock Rate: 0.612 GHz
[10/03/2023-05:31:05] [I] 
[10/03/2023-05:31:05] [I] TensorRT version: 8.5.2
[10/03/2023-05:31:05] [I] Engine loaded in 0.0635924 sec.
[10/03/2023-05:31:05] [I] [TRT] Loaded engine size: 72 MiB
[10/03/2023-05:31:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +69, now: CPU 0, GPU 69 (MiB)
[10/03/2023-05:31:06] [I] Engine deserialized in 0.844545 sec.
[10/03/2023-05:31:06] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +58, now: CPU 0, GPU 127 (MiB)
[10/03/2023-05:31:06] [I] Setting persistentCacheLimit to 0 bytes.
[10/03/2023-05:31:06] [I] Using random values for input input
[10/03/2023-05:31:06] [I] Created input binding for input with dimensions 1x3x640x640
[10/03/2023-05:31:06] [I] Using random values for output boxes
[10/03/2023-05:31:06] [I] Created output binding for boxes with dimensions 1x25200x4
[10/03/2023-05:31:06] [I] Using random values for output scores
[10/03/2023-05:31:06] [I] Created output binding for scores with dimensions 1x25200x1
[10/03/2023-05:31:06] [I] Using random values for output classes
[10/03/2023-05:31:06] [I] Created output binding for classes with dimensions 1x25200x1
[10/03/2023-05:31:06] [I] Starting inference
[10/03/2023-05:31:09] [I] Warmup completed 6 queries over 200 ms
[10/03/2023-05:31:09] [I] Timing trace has 136 queries over 3.07206 s
[10/03/2023-05:31:09] [I] 
[10/03/2023-05:31:09] [I] === Trace details ===
[10/03/2023-05:31:09] [I] Trace averages of 10 runs:
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4587 ms - Host latency: 22.8501 ms (enqueue 1.72061 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.3935 ms - Host latency: 22.7773 ms (enqueue 1.73514 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.2502 ms - Host latency: 22.6302 ms (enqueue 1.70425 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.5454 ms - Host latency: 22.9373 ms (enqueue 1.72961 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.295 ms - Host latency: 22.6798 ms (enqueue 1.73606 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4483 ms - Host latency: 22.8367 ms (enqueue 1.65728 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4314 ms - Host latency: 22.8232 ms (enqueue 1.5458 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4336 ms - Host latency: 22.823 ms (enqueue 1.61167 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4293 ms - Host latency: 22.8149 ms (enqueue 1.77466 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4159 ms - Host latency: 22.8056 ms (enqueue 1.58232 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4242 ms - Host latency: 22.813 ms (enqueue 1.59656 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4356 ms - Host latency: 22.8239 ms (enqueue 1.76089 ms)
[10/03/2023-05:31:09] [I] Average on 10 runs - GPU latency: 22.4284 ms - Host latency: 22.8187 ms (enqueue 1.58228 ms)
[10/03/2023-05:31:09] [I] 
[10/03/2023-05:31:09] [I] === Performance summary ===
[10/03/2023-05:31:09] [I] Throughput: 44.27 qps
[10/03/2023-05:31:09] [I] Latency: min = 22.4785 ms, max = 23.4652 ms, mean = 22.8028 ms, median = 22.8085 ms, percentile(90%) = 22.8674 ms, percentile(95%) = 23.2997 ms, percentile(99%) = 23.3519 ms
[10/03/2023-05:31:09] [I] Enqueue Time: min = 1.46387 ms, max = 3.01385 ms, mean = 1.66769 ms, median = 1.60083 ms, percentile(90%) = 1.75244 ms, percentile(95%) = 2.78625 ms, percentile(99%) = 2.9436 ms
[10/03/2023-05:31:09] [I] H2D Latency: min = 0.30896 ms, max = 0.353516 ms, mean = 0.325468 ms, median = 0.324768 ms, percentile(90%) = 0.334839 ms, percentile(95%) = 0.337952 ms, percentile(99%) = 0.343323 ms
[10/03/2023-05:31:09] [I] GPU Compute Time: min = 22.1028 ms, max = 23.0645 ms, mean = 22.4151 ms, median = 22.4233 ms, percentile(90%) = 22.4722 ms, percentile(95%) = 22.9161 ms, percentile(99%) = 22.9575 ms
[10/03/2023-05:31:09] [I] D2H Latency: min = 0.0471191 ms, max = 0.0653381 ms, mean = 0.0622159 ms, median = 0.0622559 ms, percentile(90%) = 0.0634155 ms, percentile(95%) = 0.0635986 ms, percentile(99%) = 0.0648193 ms
[10/03/2023-05:31:09] [I] Total Host Walltime: 3.07206 s
[10/03/2023-05:31:09] [I] Total GPU Compute Time: 3.04846 s
[10/03/2023-05:31:09] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/03/2023-05:31:09] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp16.engine

  1. Perhaps this is related to gpu clocks boosting? Could you elaborate on that?
  2. I get this warning when running deepstream on the Orin device. Perhaps it’s responsible for such low perfomance?

I would really appreciate your further support on these questions, thanks!

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

  1. qps means “query of second”, please refer to doc.
  2. please use LD_PRELOAD to specify the tensorrt plugin.
    3/4. please make sure jetson_clocks is set, please refer to this doc.
  3. the first line should be related to setup. how to get this log? did you try reinstall?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.