Hello,
I am trying to understand why there are always 4 processes of GstNvInferServer in parallel using deepstream on AGX Orin.
I have the same behaviour with the plugin GstNvInfer.
Here is the .nsys-rep file of one trial.
deepstream_profile_2_instances.zip (6.4 MB)
What ever the value in the config.pbtxt, it doesn’t change anything :
instance_group [
{
kind: KIND_GPU
count: 1
}
]
Thank you
yuweiw
June 12, 2025, 2:09am
4
Could you attach your whole pipeline and the version of DeepStream?
As I can’t share my own pipeline because of NDA, I tried to reproduce the same behaviour using the repo deepstream_python_apps.
I have used this example : deepstream_python_apps/apps/deepstream-test1-rtsp-out/deepstream_test1_rtsp_out.py at master · NVIDIA-AI-IOT/deepstream_python_apps · GitHub
And modified it accordingly :
...
pgie = Gst.ElementFactory.make("nvinferserver", "primary-inference")
...
pgie.set_property('config-file-path', "pgie_config_triton_grpc.txt")
Deepstream 7.1
Here is the nvinferserverconfig file :
pgie_config_triton_grpc.txt (879 Bytes)
If I launch the command : nsys profile --trace=cuda,nvtx,osrt --output=deepstream_profile python deepstream_test_1_rtsp_out.py -i /opt/nvidia/deepstream/deepstream/samples/streams/sample_720p.h264
I got these results :
deepstream_profile_test_1_rtsp_out_triton.zip (8.3 MB)
You can see the same behaviour here.
Thanks for your help.
yuweiw
June 17, 2025, 8:59am
6
That’s timeline view. In fact, there is only one thread. You can check that with the Show In Events View, they are all in one TID.
You can also check our source code below.
sources\gst-plugins\gst-nvinferserver\gstnvinferserver.cpp
static GstFlowReturn gst_nvinfer_server_submit_input_buffer(
GstBaseTransform* btrans, gboolean discont, GstBuffer* inbuf) {
I agree with you, but why are there 4 batches that are handled in parallel while sometimes there is only one ?
If you take the previous example (deepstream_test1_rtsp_out.py), what is the bottleneck ? based on the nsys analysis profile.
Thank you
yuweiw
June 18, 2025, 3:20am
8
What bottleneck do you need to solve in your scenario? You can also get the latency of each plugin to check the bottleneck. Please refer to our Enable Latency measurement for deepstream sample apps .
I have made some tests on my side, and I would like to understand what is causing the following drop in performance.
Here is my setup :
AGX Orin1 (deepstream pipeline - example deepstream_test_1_rtsp_out.py + triton server 25.05)
AGX Orin 2 (triton server 25.05)
Switch 10Gbit/s in between
I run the example using a 120fps 1920x1080 h264 video
TEST 1 - Only AGX Orin 1 (pipeline + triton on the same device)
The pipeline is running fine and the actual FPS is 120 fps
Here is the nsys file :
deepstream_test_1_rtsp_out_120fps_local.zip (7.8 MB)
TEST 2 - AGX Orin 1 (pipeline) + AGX Orin 2 (triton)
It is exactly the same config as TEST 1, except the url of the triton server in config.txt file for the nvinferserver plugin.
The pipeline FPS drops to around 93 fps
Here is the nsys file :
deepstream_test_1_rtsp_out_120fps_remote.zip (8.3 MB)
TEST 3 - AGX Orin 1 (perf_analyzer) + AGX Orin 2 (triton)
Here is the command :
perf_analyzer -m trafficcamnet -i grpc -u 192.168.121.2:8001 --concurrency-range 1:8 -b 1
and the results :
...
Request concurrency: 3
Client:
Request count: 2575
Throughput: 119.757 infer/sec
Avg latency: 21630 usec (standard deviation 3304 usec)
p50 latency: 21279 usec
p90 latency: 25983 usec
p95 latency: 27480 usec
p99 latency: 30475 usec
Avg gRPC time: 21591 usec ((un)marshal request/response 715 usec + response wait 20876 usec)
Server:
Inference count: 2574
Execution count: 2574
Successful request count: 2574
Avg request latency: 7235 usec (overhead 398 usec + queue 98 usec + compute input 1080 usec + compute infer 5030 usec + compute output 628 usec)
...
As you can see, the perf_analyzer can reach the 120 fps with the remote triton server. Therefore, it would suggest it is not a network issue.
How do you explain that there is a drop in FPS between local and remote triton server ? As it doesn’t seem to be a network issue because of the successful perf_analyzer results.
Thank you.
yuweiw
June 24, 2025, 3:13am
10
Could you please do the following tests separately to help analyze this issue?
Set the instance_group count to 3 and try
Use the perf_analyzer to get the perf data in the Only AGX Orin 1 case
Try to set the buffer-pool-size to 8 for the nvstreammux plugin and check if that can improve the perf.
instance_group was already set to 5, but yes it has an impact if I reduce to 1 or 2.
In my case, I think that between tests, jetson_clocks were not always enabled (between reboot), that was causing the drop in performance.
I have made a service to set them automatically at boot. Since then, I see more consistance in performances.
yuweiw
July 15, 2025, 9:25am
12
Glad to hear that. If possible, you could share your method about how to made a service to set them automatically at boot, so that others can refer to that. Thanks
Enable Jetson-clocks
sudo /usr/bin/jetson_clocks
You can create a service for this :
sudo nano /etc/systemd/system/jetson_clocks.service
[Unit]
Description=Jetson Clocks Service
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/usr/bin/jetson_clocks
RemainAfterExit=true
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable jetson_clocks.service
1 Like
system
Closed
July 29, 2025, 9:31am
14
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.