Hello, folks!
I’m experiencing performance issues with YOLOv8 using DeepStream6.2. I’m using the default yolov8s.pt file and the generated cfg, wts, and labels.txt files (source: https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8s.pt ) and performing inference on the sample_1080p_h264.mp4 video (path: /opt/nvidia/deepstream/deepstream/samples/streams/). The indicated performance is around 25FPS on a 1920x1080 monitor operating at 60Hz. How can I increase this frame rate?
System Specifications:
Jetson Xavier NX
Volta GPU 384-core NVIDIA with 48 Tensor Cores
Ubuntu 20.04 LTS
CUDA 11.4.315
DeepStream 6.2
JetPack 5.1
PyTorch 1.12.0
Torchvision 0.13.0
TensorRT 8.5.2.2
Output:
**PERF: FPS 0 (Avg)
**PERF: 0.00 (0.00)
** INFO: <bus_callback:239>: Pipeline ready
Opening in BLOCKING MODE
NvMMLiteOpen : Block : BlockType = 261
NVMEDIA: Reading vendor.tegra.display-size : status: 6
NvMMLiteBlockCreate : Block : BlockType = 261
** INFO: <bus_callback:225>: Pipeline running
**PERF: 34.75 (31.86)
**PERF: 25.09 (27.03)
**PERF: 24.75 (26.27)
**PERF: 25.57 (24.91)
nvstreammux: Successfully handled EOS for source_id=0
**PERF: 25.53 (25.97)
**PERF: 25.68 (25.57)
**PERF: 25.52 (25.73)
**PERF: 25.61 (25.76)
**PERF: 25.63 (25.55)
**PERF: 25.59 (25.47)
**PERF: 25.52 (25.41)
** INFO: <bus_callback:262>: Received EOS. Exiting …
File deepstream_app_config.txt:
[application]
enable-perf-measurement=1
perf-measurement-interval-sec=5
[tiled-display]
enable=1
rows=1
columns=1
width=1280
height=720
gpu-id=0
nvbuf-memory-type=0
[source0]
enable=1
type=3
uri=file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
num-sources=1
num-extra-surfaces=24
gpu-id=0
cudadec-memtype=0
[sink0]
enable=1
type=2
sync=0
gpu-id=0
nvbuf-memory-type=0
[osd]
enable=1
gpu-id=0
border-width=5
text-size=15
text-color=1;1;1;1;
text-bg-color=0.3;0.3;0.3;1
font=Serif
show-clock=0
clock-x-offset=800
clock-y-offset=820
clock-text-size=12
clock-color=1;0;0;0
nvbuf-memory-type=0
[streammux]
gpu-id=0
live-source=0
buffer-pool-size=1000
batch-size=1000
batched-push-timeout=100000
width=1280
height=720
enable-padding=0
nvbuf-memory-type=0
[primary-gie]
enable=1
gpu-id=0
batch-size=1
gie-unique-id=1
nvbuf-memory-type=0
config-file=config_infer_primary_yoloV8.txt
[tests]
file-loop=0
Thank you!
yingliu
September 18, 2023, 1:31pm
3
Please also share your PGIE config file, thanks.
Okay, here is the PGIE config file:
File config_infer_primary_yoloV8.txt:
[property]
gpu-id=0
net-scale-factor=0.0039215697906911373
model-color-format=0
custom-network-config=yolov8s.cfg
model-file=yolov8s.wts
model-engine-file=model_b1_gpu0_fp32.engine
#int8-calib-file=calib.table
labelfile-path=labels.txt
batch-size=10 #=1 default
network-mode=0
num-detected-classes=80
interval=0
gie-unique-id=1
process-mode=2
network-type=0
cluster-mode=2
maintain-aspect-ratio=1
symmetric-padding=1
parse-bbox-func-name=NvDsInferParseYolo
custom-lib-path=nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so
engine-create-func-name=NvDsInferYoloCudaEngineGet
[class-attrs-all]
nms-iou-threshold=0.45
pre-cluster-threshold=0.25
topk=300
Why did you set nvstreammux batch size as 1000? Frequently Asked Questions — DeepStream 6.3 Release documentation
batched-push-timeout should be 1/framerate(ms)
Please measure the model performance of the “model_b1_gpu0_fp32.engine” by the “trtexec” tool.
Hi, @Fiona.Chen . I’m sorry for the delay. I tried to measure the performance using the command:
/usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine
but it resulted the following errors:
[09/25/2023-09:29:01] [E] Error[1]:[pluginV2Runner.cpp::load::299] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
[09/25/2023-09:29:01] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::65] Error Code 4: Internal Error (Engine deserialization failed.)
[09/25/2023-09:29:01] [E] Engine deserialization failed
[09/25/2023-09:29:01] [E] Got invalid engine!
[09/25/2023-09:29:01] [E] Inference set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --loadEngine=model_b1_gpu0_fp32.engine
How should I procedure?
How and where did you generate the engine file?
I’ve measured the model engine performance by trtexec in my Xavier NX board with max power.
[09/26/2023-14:40:55] [I] === Performance summary ===
[09/26/2023-14:40:55] [I] Throughput: 21.5936 qps
[09/26/2023-14:40:55] [I] Latency: min = 45.8785 ms, max = 46.2848 ms, mean = 46.0246 ms, median = 46.0143 ms, percentile(90%) = 46.147 ms, percentile(95%) = 46.1866 ms, percentile(99%) = 46.2848 ms
[09/26/2023-14:40:55] [I] Enqueue Time: min = 2.44128 ms, max = 4.10165 ms, mean = 2.77255 ms, median = 2.69226 ms, percentile(90%) = 2.96558 ms, percentile(95%) = 3.55359 ms, percentile(99%) = 4.10165 ms
[09/26/2023-14:40:55] [I] H2D Latency: min = 0.278076 ms, max = 0.387558 ms, mean = 0.289538 ms, median = 0.283432 ms, percentile(90%) = 0.306519 ms, percentile(95%) = 0.31073 ms, percentile(99%) = 0.387558 ms
[09/26/2023-14:40:55] [I] GPU Compute Time: min = 45.5718 ms, max = 45.9738 ms, mean = 45.7089 ms, median = 45.6978 ms, percentile(90%) = 45.8298 ms, percentile(95%) = 45.8577 ms, percentile(99%) = 45.9738 ms
[09/26/2023-14:40:55] [I] D2H Latency: min = 0.0153809 ms, max = 0.0288086 ms, mean = 0.0261959 ms, median = 0.0263672 ms, percentile(90%) = 0.0280762 ms, percentile(95%) = 0.0283203 ms, percentile(99%) = 0.0288086 ms
[09/26/2023-14:40:55] [I] Total Host Walltime: 3.10278 s
[09/26/2023-14:40:55] [I] Total GPU Compute Time: 3.0625 s
[09/26/2023-14:40:55] [I] Explanations of the performance metrics are printed in the verbose logs.
The compute time is more than 45ms. So the model is the bottleneck. Yolov8 may be too heavy for Xavier NX, you need to optimize the model first.
There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks
If your platform can get similar trtexec performance, you may need to set “interval=4” to gst-nvinfer to skip some frames inferencing to achieve the 60 FPS speed.
system
Closed
October 17, 2023, 1:41am
10
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.