PERF issues with DeepStream6.2 + YOLOv8 in Jetson Xavier

stefano.oliveira · September 18, 2023, 1:29pm

Hello, folks!

I’m experiencing performance issues with YOLOv8 using DeepStream6.2. I’m using the default yolov8s.pt file and the generated cfg, wts, and labels.txt files (source: https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8s.pt) and performing inference on the sample_1080p_h264.mp4 video (path: /opt/nvidia/deepstream/deepstream/samples/streams/). The indicated performance is around 25FPS on a 1920x1080 monitor operating at 60Hz. How can I increase this frame rate?

System Specifications:
Jetson Xavier NX
Volta GPU 384-core NVIDIA with 48 Tensor Cores
Ubuntu 20.04 LTS
CUDA 11.4.315
DeepStream 6.2
JetPack 5.1
PyTorch 1.12.0
Torchvision 0.13.0
TensorRT 8.5.2.2

Output:

**PERF: FPS 0 (Avg)
**PERF: 0.00 (0.00)
** INFO: <bus_callback:239>: Pipeline ready

Opening in BLOCKING MODE
NvMMLiteOpen : Block : BlockType = 261
NVMEDIA: Reading vendor.tegra.display-size : status: 6
NvMMLiteBlockCreate : Block : BlockType = 261
** INFO: <bus_callback:225>: Pipeline running

**PERF: 34.75 (31.86)
**PERF: 25.09 (27.03)
**PERF: 24.75 (26.27)
**PERF: 25.57 (24.91)
nvstreammux: Successfully handled EOS for source_id=0
**PERF: 25.53 (25.97)
**PERF: 25.68 (25.57)
**PERF: 25.52 (25.73)
**PERF: 25.61 (25.76)
**PERF: 25.63 (25.55)
**PERF: 25.59 (25.47)
**PERF: 25.52 (25.41)
** INFO: <bus_callback:262>: Received EOS. Exiting …

File deepstream_app_config.txt:

[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5

[tiled-display]
enable=1
rows=1
columns=1
width=1280
height=720
gpu-id=0
nvbuf-memory-type=0

[source0]
enable=1
type=3
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
num-sources=1
num-extra-surfaces=24
gpu-id=0
cudadec-memtype=0

[sink0]
enable=1
type=2
sync=0
gpu-id=0
nvbuf-memory-type=0

[osd]
enable=1
gpu-id=0
border-width=5
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0

[streammux]
gpu-id=0
live-source=0
buffer-pool-size=1000
batch-size=1000
batched-push-timeout=100000
width=1280
height=720
enable-padding=0
nvbuf-memory-type=0

[primary-gie]
enable=1
gpu-id=0
batch-size=1
gie-unique-id=1
nvbuf-memory-type=0
config-file=config_infer_primary_yoloV8.txt

[tests]
file-loop=0

Thank you!

yingliu · September 18, 2023, 1:31pm

Please also share your PGIE config file, thanks.

stefano.oliveira · September 18, 2023, 2:28pm

Okay, here is the PGIE config file:

File config_infer_primary_yoloV8.txt:

[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
model-color-format=0
custom-network-config=yolov8s.cfg
model-file=yolov8s.wts
model-engine-file=model_b1_gpu0_fp32.engine
#int8-calib-file=calib.table
labelfile-path=labels.txt
batch-size=10 #=1 default
network-mode=0
num-detected-classes=80
interval=0
gie-unique-id=1
process-mode=2
network-type=0
cluster-mode=2
maintain-aspect-ratio=1
symmetric-padding=1
parse-bbox-func-name=NvDsInferParseYolo
custom-lib-path=nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
engine-create-func-name=NvDsInferYoloCudaEngineGet

[class-attrs-all]
nms-iou-threshold=0.45
pre-cluster-threshold=0.25
topk=300

Fiona.Chen · September 19, 2023, 2:28am

Why did you set nvstreammux batch size as 1000? Frequently Asked Questions — DeepStream 6.3 Release documentation

batched-push-timeout should be 1/framerate(ms)

Please measure the model performance of the “model_b1_gpu0_fp32.engine” by the “trtexec” tool.

stefano.oliveira · September 25, 2023, 12:35pm

Hi, @Fiona.Chen. I’m sorry for the delay. I tried to measure the performance using the command:

/usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine

but it resulted the following errors:

[09/25/2023-09:29:01] [E] Error[1]:[pluginV2Runner.cpp::load::299] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
[09/25/2023-09:29:01] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::65] Error Code 4: Internal Error (Engine deserialization failed.)
[09/25/2023-09:29:01] [E] Engine deserialization failed
[09/25/2023-09:29:01] [E] Got invalid engine!
[09/25/2023-09:29:01] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine

How should I procedure?

Fiona.Chen · September 26, 2023, 6:09am

How and where did you generate the engine file?

Fiona.Chen · September 26, 2023, 6:43am

I’ve measured the model engine performance by trtexec in my Xavier NX board with max power.

[09/26/2023-14:40:55] [I] === Performance summary ===
[09/26/2023-14:40:55] [I] Throughput: 21.5936 qps
[09/26/2023-14:40:55] [I] Latency: min = 45.8785 ms, max = 46.2848 ms, mean = 46.0246 ms, median = 46.0143 ms, percentile(90%) = 46.147 ms, percentile(95%) = 46.1866 ms, percentile(99%) = 46.2848 ms
[09/26/2023-14:40:55] [I] Enqueue Time: min = 2.44128 ms, max = 4.10165 ms, mean = 2.77255 ms, median = 2.69226 ms, percentile(90%) = 2.96558 ms, percentile(95%) = 3.55359 ms, percentile(99%) = 4.10165 ms
[09/26/2023-14:40:55] [I] H2D Latency: min = 0.278076 ms, max = 0.387558 ms, mean = 0.289538 ms, median = 0.283432 ms, percentile(90%) = 0.306519 ms, percentile(95%) = 0.31073 ms, percentile(99%) = 0.387558 ms
[09/26/2023-14:40:55] [I] GPU Compute Time: min = 45.5718 ms, max = 45.9738 ms, mean = 45.7089 ms, median = 45.6978 ms, percentile(90%) = 45.8298 ms, percentile(95%) = 45.8577 ms, percentile(99%) = 45.9738 ms
[09/26/2023-14:40:55] [I] D2H Latency: min = 0.0153809 ms, max = 0.0288086 ms, mean = 0.0261959 ms, median = 0.0263672 ms, percentile(90%) = 0.0280762 ms, percentile(95%) = 0.0283203 ms, percentile(99%) = 0.0288086 ms
[09/26/2023-14:40:55] [I] Total Host Walltime: 3.10278 s
[09/26/2023-14:40:55] [I] Total GPU Compute Time: 3.0625 s
[09/26/2023-14:40:55] [I] Explanations of the performance metrics are printed in the verbose logs.

The compute time is more than 45ms. So the model is the bottleneck. Yolov8 may be too heavy for Xavier NX, you need to optimize the model first.

Fiona.Chen · September 26, 2023, 9:55am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

If your platform can get similar trtexec performance, you may need to set “interval=4” to gst-nvinfer to skip some frames inferencing to achieve the 60 FPS speed.

system · October 17, 2023, 1:41am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom YOLOv4 Model Performance DeepStream SDK deepstream	2	623	April 18, 2022
Deepstream yolov4 process multiple streams is slow DeepStream SDK	7	1373	November 30, 2021
DeepStream 5 vs 6 inference time and calculate fps in the pipeline on Jetson Nano DeepStream SDK	9	2831	January 14, 2022
PERF issues with DeepStream 6.3, YOLOv8l in Jetson Orin Nano DeepStream SDK jetson	7	65	September 23, 2024
Inference with deepstream yolov5s-3.0 on 2 camera long delay (20-25s) DeepStream SDK	18	2273	October 12, 2021
Slow real-time performance when running custom YOLOv3 app DeepStream SDK	9	1984	July 13, 2021
Low FPS in Deepstream and Yolov4 on Jetson AGX Xavier DeepStream SDK	7	2228	October 12, 2021
Deepstream 6 YOLO performance issue DeepStream SDK	27	5063	December 27, 2021
Deepstream 4.0 on YoloV3(Webcam vs. Video) DeepStream SDK	4	1104	October 12, 2021
How to adjust the paramerters to acclearte the yolov7 on deepstream? I got fps 8, i think it must be happened something wrong when i did DeepStream SDK	14	755	December 5, 2022

PERF issues with DeepStream6.2 + YOLOv8 in Jetson Xavier

Related topics