Anyway to boost yolo performance on Jetson Orin?

I have a jetson orin nano 8GB board.

  • JetPack 5.1.4
  • OpenCV 4.9.0 (cuda enabled)
$ yolo version
8.3.28

run model=yolo11n.engin, 1920x1080P @30FPS

I found there is only 10~ 12FPS, but the GPU & CPU is NOT that high.

Is there anyway to boost performance?

Hi,

Have you maximized device performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Could you monitor GPU usage with tegrastats at the same time?

$ sudo tegrastats

Thanks

@AastaLLL

I got the same result ~ 10FPS, but the streaming is 30FPS/RTP live

Please check jtop/ tegrastats result.

PS: yolo was started at the middle of the video about 40s.

EDIT: Check engine file but failed

  1. I got this file from YOLO11 🚀 NEW - Ultralytics YOLO Docs .
  2. And followed NVIDIA Jetson - Ultralytics YOLO Docs to convert pt file to TensorRT model yolo export model=yolo11n.pt format=engin

Now, it failed???

$ /usr/src/tensorrt/bin/trtexec ./model/yolo11n.engine
[11/19/2024-16:43:32] [E] Model missing or format not recognized

Any idea???

It seems quite similar to Slow FPS on Orin Nano 8 GB - YoloV8

— EDIT: working, can you help to check if it’s OK?

$ /usr/src/tensorrt/bin/trtexec --loadEngine=model/yolo11n.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model/yolo11n.engine
[11/19/2024-17:03:33] [I] === Model Options ===
[11/19/2024-17:03:33] [I] Format: *
[11/19/2024-17:03:33] [I] Model:
[11/19/2024-17:03:33] [I] Output:
[11/19/2024-17:03:33] [I] === Build Options ===
[11/19/2024-17:03:33] [I] Max batch: 1
[11/19/2024-17:03:33] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/19/2024-17:03:33] [I] minTiming: 1
[11/19/2024-17:03:33] [I] avgTiming: 8
[11/19/2024-17:03:33] [I] Precision: FP32
[11/19/2024-17:03:33] [I] LayerPrecisions:
[11/19/2024-17:03:33] [I] Calibration:
[11/19/2024-17:03:33] [I] Refit: Disabled
[11/19/2024-17:03:33] [I] Sparsity: Disabled
[11/19/2024-17:03:33] [I] Safe mode: Disabled
[11/19/2024-17:03:33] [I] DirectIO mode: Disabled
[11/19/2024-17:03:33] [I] Restricted mode: Disabled
[11/19/2024-17:03:33] [I] Build only: Disabled
[11/19/2024-17:03:33] [I] Save engine:
[11/19/2024-17:03:33] [I] Load engine: model/yolo11n.engine
[11/19/2024-17:03:33] [I] Profiling verbosity: 0
[11/19/2024-17:03:33] [I] Tactic sources: Using default tactic sources
[11/19/2024-17:03:33] [I] timingCacheMode: local
[11/19/2024-17:03:33] [I] timingCacheFile:
[11/19/2024-17:03:33] [I] Heuristic: Disabled
[11/19/2024-17:03:33] [I] Preview Features: Use default preview flags.
[11/19/2024-17:03:33] [I] Input(s)s format: fp32:CHW
[11/19/2024-17:03:33] [I] Output(s)s format: fp32:CHW
[11/19/2024-17:03:33] [I] Input build shapes: model
[11/19/2024-17:03:33] [I] Input calibration shapes: model
[11/19/2024-17:03:33] [I] === System Options ===
[11/19/2024-17:03:33] [I] Device: 0
[11/19/2024-17:03:33] [I] DLACore:
[11/19/2024-17:03:33] [I] Plugins:
[11/19/2024-17:03:33] [I] === Inference Options ===
[11/19/2024-17:03:33] [I] Batch: 1
[11/19/2024-17:03:33] [I] Input inference shapes: model
[11/19/2024-17:03:33] [I] Iterations: 10
[11/19/2024-17:03:33] [I] Duration: 3s (+ 200ms warm up)
[11/19/2024-17:03:33] [I] Sleep time: 0ms
[11/19/2024-17:03:33] [I] Idle time: 0ms
[11/19/2024-17:03:33] [I] Streams: 1
[11/19/2024-17:03:33] [I] ExposeDMA: Disabled
[11/19/2024-17:03:33] [I] Data transfers: Enabled
[11/19/2024-17:03:33] [I] Spin-wait: Disabled
[11/19/2024-17:03:33] [I] Multithreading: Disabled
[11/19/2024-17:03:33] [I] CUDA Graph: Disabled
[11/19/2024-17:03:33] [I] Separate profiling: Disabled
[11/19/2024-17:03:33] [I] Time Deserialize: Disabled
[11/19/2024-17:03:33] [I] Time Refit: Disabled
[11/19/2024-17:03:33] [I] NVTX verbosity: 0
[11/19/2024-17:03:33] [I] Persistent Cache Ratio: 0
[11/19/2024-17:03:33] [I] Inputs:
[11/19/2024-17:03:33] [I] === Reporting Options ===
[11/19/2024-17:03:33] [I] Verbose: Disabled
[11/19/2024-17:03:33] [I] Averages: 10 inferences
[11/19/2024-17:03:33] [I] Percentiles: 90,95,99
[11/19/2024-17:03:33] [I] Dump refittable layers:Disabled
[11/19/2024-17:03:33] [I] Dump output: Disabled
[11/19/2024-17:03:33] [I] Profile: Disabled
[11/19/2024-17:03:33] [I] Export timing to JSON file:
[11/19/2024-17:03:33] [I] Export output to JSON file:
[11/19/2024-17:03:33] [I] Export profile to JSON file:
[11/19/2024-17:03:33] [I]
[11/19/2024-17:03:33] [I] === Device Information ===
[11/19/2024-17:03:33] [I] Selected Device: Orin
[11/19/2024-17:03:33] [I] Compute Capability: 8.7
[11/19/2024-17:03:33] [I] SMs: 8
[11/19/2024-17:03:33] [I] Compute Clock Rate: 0.624 GHz
[11/19/2024-17:03:33] [I] Device Global Memory: 7451 MiB
[11/19/2024-17:03:33] [I] Shared Memory per SM: 164 KiB
[11/19/2024-17:03:33] [I] Memory Bus Width: 128 bits (ECC disabled)
[11/19/2024-17:03:33] [I] Memory Clock Rate: 0.624 GHz
[11/19/2024-17:03:33] [I]
[11/19/2024-17:03:33] [I] TensorRT version: 8.5.2
[11/19/2024-17:03:33] [I] Engine loaded in 0.0139191 sec.
[11/19/2024-17:03:34] [I] [TRT] Loaded engine size: 11 MiB
[11/19/2024-17:03:34] [E] Error[1]: [stdArchiveReader.cpp::StdArchiveReader::32] Error Code 1: Serialization (Serialization assertion magicTagRead == kMAGIC_TAG failed.Magic tag does not match)
[11/19/2024-17:03:34] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::65] Error Code 4: Internal Error (Engine deserialization failed.)
[11/19/2024-17:03:34] [E] Engine deserialization failed
[11/19/2024-17:03:34] [E] Got invalid engine!
[11/19/2024-17:03:34] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model/yolo11n.engine

I got engine format fixed, but performance didn’t improve much, just from 10FPS to 13 FPS. Still got issue.

EDIT: And it can’t predict!!!

$ /usr/src/tensorrt/bin/trtexec --loadEngine=./model/yolo11n.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./model/yolo11n.engine
[11/19/2024-17:21:38] [I] === Model Options ===
[11/19/2024-17:21:38] [I] Format: *
[11/19/2024-17:21:38] [I] Model:
[11/19/2024-17:21:38] [I] Output:
[11/19/2024-17:21:38] [I] === Build Options ===
[11/19/2024-17:21:38] [I] Max batch: 1
[11/19/2024-17:21:38] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/19/2024-17:21:38] [I] minTiming: 1
[11/19/2024-17:21:38] [I] avgTiming: 8
[11/19/2024-17:21:38] [I] Precision: FP32
[11/19/2024-17:21:38] [I] LayerPrecisions:
[11/19/2024-17:21:38] [I] Calibration:
[11/19/2024-17:21:38] [I] Refit: Disabled
[11/19/2024-17:21:38] [I] Sparsity: Disabled
[11/19/2024-17:21:38] [I] Safe mode: Disabled
[11/19/2024-17:21:38] [I] DirectIO mode: Disabled
[11/19/2024-17:21:38] [I] Restricted mode: Disabled
[11/19/2024-17:21:38] [I] Build only: Disabled
[11/19/2024-17:21:38] [I] Save engine:
[11/19/2024-17:21:38] [I] Load engine: ./model/yolo11n.engine
[11/19/2024-17:21:38] [I] Profiling verbosity: 0
[11/19/2024-17:21:38] [I] Tactic sources: Using default tactic sources
[11/19/2024-17:21:38] [I] timingCacheMode: local
[11/19/2024-17:21:38] [I] timingCacheFile:
[11/19/2024-17:21:38] [I] Heuristic: Disabled
[11/19/2024-17:21:38] [I] Preview Features: Use default preview flags.
[11/19/2024-17:21:38] [I] Input(s)s format: fp32:CHW
[11/19/2024-17:21:38] [I] Output(s)s format: fp32:CHW
[11/19/2024-17:21:38] [I] Input build shapes: model
[11/19/2024-17:21:38] [I] Input calibration shapes: model
[11/19/2024-17:21:38] [I] === System Options ===
[11/19/2024-17:21:38] [I] Device: 0
[11/19/2024-17:21:38] [I] DLACore:
[11/19/2024-17:21:38] [I] Plugins:
[11/19/2024-17:21:38] [I] === Inference Options ===
[11/19/2024-17:21:38] [I] Batch: 1
[11/19/2024-17:21:38] [I] Input inference shapes: model
[11/19/2024-17:21:38] [I] Iterations: 10
[11/19/2024-17:21:38] [I] Duration: 3s (+ 200ms warm up)
[11/19/2024-17:21:38] [I] Sleep time: 0ms
[11/19/2024-17:21:38] [I] Idle time: 0ms
[11/19/2024-17:21:38] [I] Streams: 1
[11/19/2024-17:21:38] [I] ExposeDMA: Disabled
[11/19/2024-17:21:38] [I] Data transfers: Enabled
[11/19/2024-17:21:38] [I] Spin-wait: Disabled
[11/19/2024-17:21:38] [I] Multithreading: Disabled
[11/19/2024-17:21:38] [I] CUDA Graph: Disabled
[11/19/2024-17:21:38] [I] Separate profiling: Disabled
[11/19/2024-17:21:38] [I] Time Deserialize: Disabled
[11/19/2024-17:21:38] [I] Time Refit: Disabled
[11/19/2024-17:21:38] [I] NVTX verbosity: 0
[11/19/2024-17:21:38] [I] Persistent Cache Ratio: 0
[11/19/2024-17:21:38] [I] Inputs:
[11/19/2024-17:21:38] [I] === Reporting Options ===
[11/19/2024-17:21:38] [I] Verbose: Disabled
[11/19/2024-17:21:38] [I] Averages: 10 inferences
[11/19/2024-17:21:38] [I] Percentiles: 90,95,99
[11/19/2024-17:21:38] [I] Dump refittable layers:Disabled
[11/19/2024-17:21:38] [I] Dump output: Disabled
[11/19/2024-17:21:38] [I] Profile: Disabled
[11/19/2024-17:21:38] [I] Export timing to JSON file:
[11/19/2024-17:21:38] [I] Export output to JSON file:
[11/19/2024-17:21:38] [I] Export profile to JSON file:
[11/19/2024-17:21:38] [I]
[11/19/2024-17:21:38] [I] === Device Information ===
[11/19/2024-17:21:38] [I] Selected Device: Orin
[11/19/2024-17:21:38] [I] Compute Capability: 8.7
[11/19/2024-17:21:38] [I] SMs: 8
[11/19/2024-17:21:38] [I] Compute Clock Rate: 0.624 GHz
[11/19/2024-17:21:38] [I] Device Global Memory: 7451 MiB
[11/19/2024-17:21:38] [I] Shared Memory per SM: 164 KiB
[11/19/2024-17:21:38] [I] Memory Bus Width: 128 bits (ECC disabled)
[11/19/2024-17:21:38] [I] Memory Clock Rate: 0.624 GHz
[11/19/2024-17:21:38] [I]
[11/19/2024-17:21:38] [I] TensorRT version: 8.5.2
[11/19/2024-17:21:38] [I] Engine loaded in 0.0135807 sec.
[11/19/2024-17:21:38] [I] [TRT] Loaded engine size: 11 MiB
[11/19/2024-17:21:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +616, GPU +586, now: CPU 907, GPU 3559 (MiB)
[11/19/2024-17:21:40] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +10, now: CP                                        U 0, GPU 10 (MiB)
[11/19/2024-17:21:40] [I] Engine deserialized in 1.89538 sec.
[11/19/2024-17:21:40] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 908, GPU 3559 (MiB)
[11/19/2024-17:21:40] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +18, now                                        : CPU 0, GPU 28 (MiB)
[11/19/2024-17:21:40] [I] Setting persistentCacheLimit to 0 bytes.
[11/19/2024-17:21:40] [I] Using random values for input images
[11/19/2024-17:21:40] [I] Created input binding for images with dimensions 1x3x640x640
[11/19/2024-17:21:40] [I] Using random values for output output0
[11/19/2024-17:21:40] [I] Created output binding for output0 with dimensions 1x84x8400
[11/19/2024-17:21:40] [I] Starting inference
[11/19/2024-17:21:43] [I] Warmup completed 1 queries over 200 ms
[11/19/2024-17:21:43] [I] Timing trace has 168 queries over 2.0829 s
[11/19/2024-17:21:43] [I]
[11/19/2024-17:21:43] [I] === Trace details ===
[11/19/2024-17:21:43] [I] Trace averages of 10 runs:
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.377 ms - Host latency: 13.1714 ms (enqueue 2.26556 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3984 ms - Host latency: 13.23 ms (enqueue 2.01779 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.4013 ms - Host latency: 13.2285 ms (enqueue 1.99799 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3937 ms - Host latency: 13.2252 ms (enqueue 1.96305 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3914 ms - Host latency: 13.2226 ms (enqueue 1.96279 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.389 ms - Host latency: 13.2236 ms (enqueue 1.96001 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3958 ms - Host latency: 13.2285 ms (enqueue 1.9571 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3916 ms - Host latency: 13.2211 ms (enqueue 1.96346 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3975 ms - Host latency: 13.2322 ms (enqueue 1.95918 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3922 ms - Host latency: 13.2249 ms (enqueue 1.96177 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3974 ms - Host latency: 13.2266 ms (enqueue 2.03279 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3936 ms - Host latency: 13.2238 ms (enqueue 1.9748 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3925 ms - Host latency: 13.2229 ms (enqueue 1.9603 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3931 ms - Host latency: 13.2239 ms (enqueue 1.96084 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3948 ms - Host latency: 13.2253 ms (enqueue 1.96047 ms)
[11/19/2024-17:21:43] [I] Average on 10 runs - GPU latency: 12.3944 ms - Host latency: 13.2263 ms (enqueue 1.96707 ms)
[11/19/2024-17:21:43] [I]
[11/19/2024-17:21:43] [I] === Performance summary ===
[11/19/2024-17:21:43] [I] Throughput: 80.6568 qps
[11/19/2024-17:21:43] [I] Latency: min = 12.9473 ms, max = 13.2667 ms, mean = 13.2213 ms, median = 13.2251 ms, percentile(90%) = 13.2493 ms, percentile(95%) = 13.2561 ms, percentile(99%) = 13.2632 ms
[11/19/2024-17:21:43] [I] Enqueue Time: min = 1.92603 ms, max = 3.52295 ms, mean = 1.99036 ms, median = 1.97241 ms, percentile(90%) = 2.0332 ms, percentile(95%) = 2.06165 ms, percentile(99%) = 2.61487 ms
[11/19/2024-17:21:43] [I] H2D Latency: min = 0.358154 ms, max = 0.586304 ms, mean = 0.570459 ms, median = 0.573242 ms, percentile(90%) = 0.580566 ms, percentile(95%) = 0.582764 ms, percentile(99%) = 0.585815 ms
[11/19/2024-17:21:43] [I] GPU Compute Time: min = 12.3115 ms, max = 12.4321 ms, mean = 12.3928 ms, median = 12.3928 ms, percentile(90%) = 12.4167 ms, percentile(95%) = 12.4238 ms, percentile(99%) = 12.4309 ms
[11/19/2024-17:21:43] [I] D2H Latency: min = 0.166748 ms, max = 0.264282 ms, mean = 0.258093 ms, median = 0.258606 ms, percentile(90%) = 0.261475 ms, percentile(95%) = 0.262207 ms, percentile(99%) = 0.26355 ms
[11/19/2024-17:21:43] [I] Total Host Walltime: 2.0829 s
[11/19/2024-17:21:43] [I] Total GPU Compute Time: 2.08198 s
[11/19/2024-17:21:43] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/19/2024-17:21:43] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=./model/yolo11n.engine

Please help to check the system settings:

System Environment:
Python 3.8.10
    GStreamer:                   YES (1.16.3)
  NVIDIA CUDA:                   YES (ver 11.4, CUFFT CUBLAS FAST_MATH)
OpenCV version: 4.9.0 , CUDA support: True
Torch version: 2.1.0a0+41361538.nv23.06
Torchvision version: 0.16.1+fdea156


Software part of jetson-stats 4.2.12 - (c) 2024, Raffaello Bonghi
Model: NVIDIA Orin Nano Developer Kit - Jetpack 5.1.4 [L4T 35.6.0]
NV Power Mode[0]: 15W
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - P-Number: p3767-0005
 - Module: NVIDIA Jetson Orin Nano (Developer kit)
Platform:
 - Distribution: Ubuntu 20.04 focal
 - Release: 5.10.216-tegra
jtop:
 - Version: 4.2.12
 - Service: Active
Libraries:
 - CUDA: 11.4.315
 - cuDNN: 8.6.0.166
 - TensorRT: 8.5.2.2
 - VPI: 2.4.8
 - OpenCV: 4.9.0 - with CUDA: YES

Hi,

In your video, it looks like the GPU is not fully occupied so it should have room for further optimization.
But this is an Ultralytics app, you will need to check with them to see if they can further improve their implementation.

Thanks.

Is Jetson Orin Nano 8GB equipped with DLA hardware?

How to configure device param according to Export - Ultralytics YOLO Docs ?

@AastaLLL

Currently, we have figured out with yolo 11n demo.

We got :
stable ~19 FPS@1920x1080 imgz=320x320
stable ~21 FPS@1920x1080 imgz=320x320, 16 selected classes
almost ~25 FPS@1280x960 imgz=320x320, 16 selected classes

Is that all for the jetson orin nano 8GB?

Hi,

The document is written by the Ultralytics.
Unfortunately, you will need to reach out to know the exact setting for their profiling data.

Thanks.

OK, thanks.

Is there any NVIDIA demo about small object detection with NVIDIA hardware speedup?

Hi,

Have you checked our Deepstream SDK?

Thanks.

Yeah, still on other stuff, right now it’s around 27FPS close to 30FPS with RTP source running at 1920x1080@60FPS, yolo v11n imgz=320

I’ll try but not sure how much will get. Do you have any data or figure on this?

Hi,

Another common technique is to use a larger batch size so inference can be done together.
Could you check if there is a flag for the YOLO app to change the batch size?

Thanks.

Yes, I have increased batch = 8, only get 27FPS where source RTP is runing at 60FPS 1080P.

But we are trying other tech(opensource), detailed progress is here: jetson-fpv/doc/YOLO.md at main · SnapDragonfly/jetson-fpv · GitHub

The code related to yolo is simply: jetson-fpv/utils/yolo.py at main · SnapDragonfly/jetson-fpv · GitHub

Please let me know, any idea or suggestion is well appreciated.

Hi,

Maybe you can add some profiling code to the while loop to see the elapsed time of each function.

There are some format conversions in the implementation.
Getting the elapsed time of each function can give you a rough idea about where to improve.

Thanks.

Yes, as we think, most of time is on predict function, which is model.predict.

capture_image() execution time: 0.000179 seconds
cudaToNumpy() execution time: 0.000015 seconds
predict_frame() execution time: 0.023093 seconds
imshow() execution time: 0.001939 seconds
output() execution time: 0.001069 seconds

Note: The record matches real FPS.

Currently, I’m moving forward, and going to try deempstream.

Thanks for all your effect and help!

PS:
Request python deepstream demo based on RTP video feed

I just tried GitHub - marcoslucianops/DeepStream-Yolo: NVIDIA DeepStream SDK 7.1 / 7.0 / 6.4 / 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 / 5.1 implementation for YOLO models

It only has ~15 FPS with default setting Yolov4.

@AastaLLL Any idea?

Deserialize yoloLayer plugin: /home/daniel/Work/DeepStream-Yolo/yolov4.weights
0:00:05.060740580 17651 0xaaaacfae3360 INFO                 nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<primary_gie> NvDsInfe                                        rContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1988> [UID = 1]: dese                                        rialized trt engine from :/home/daniel/Work/DeepStream-Yolo/model_b1_gpu0_fp32.engine
WARNING: [TRT]: The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefini                                        tionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
INFO: [Implicit Engine Info]: layers num: 2
0   INPUT  kFLOAT input           3x608x608
1   OUTPUT kFLOAT output          22743x6

0:00:05.269902153 17651 0xaaaacfae3360 INFO                 nvinfer gstnvinfer.cpp:682:gst_nvinfer_logger:<primary_gie> NvDsInfe                                        rContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2091> [UID = 1]: Use deser                                        ialized engine model: /home/daniel/Work/DeepStream-Yolo/model_b1_gpu0_fp32.engine
0:00:05.300286056 17651 0xaaaacfae3360 INFO                 nvinfer gstnvinfer_impl.cpp:328:notifyLoadModelStatus:<primary_gie>                                         [UID 1]: Load new model:/home/daniel/Work/DeepStream-Yolo/config_infer_primary.txt sucessfully

Runtime commands:
        h: Print this help
        q: Quit

        p: Pause
        r: Resume

NOTE: To expand a source in the 2D tiled display and view object details, left-click on the source.
      To go back to the tiled display, right-click anywhere on the window.


**PERF:  FPS 0 (Avg)
**PERF:  0.00 (0.00)
** INFO: <bus_callback:239>: Pipeline ready

Opening in BLOCKING MODE
NvMMLiteOpen : Block : BlockType = 261
NvMMLiteBlockCreate : Block : BlockType = 261
** INFO: <bus_callback:225>: Pipeline running

**PERF:  13.53 (13.36)
**PERF:  13.62 (13.48)
**PERF:  13.60 (13.52)