I’m running a real-time multi-camera monitoring system on Jetson AGX Orin 64GB Developer Kit and experiencing a resource utilization bottleneck. Adding more cameras drops FPS but doesn’t increase
resource usage proportionally.
System Configuration
Hardware:
Jetson AGX Orin 64GB Developer Kit
JetPack 6.2.1 (L4T 36.4.4)
12-core Cortex-A78AE CPU
Ampere GPU (2048 CUDA cores)
Software Stack:
Rust application using ONNX Runtime with TensorRT EP
CUDA execution provider with FP16 optimization
OpenCV for video decoding (HLS .m3u8 streams)
Multi-threaded architecture (877 threads for 67 cameras)
Why isn’t GPU utilization scaling beyond 65%? Is there a TensorRT stream/context bottleneck?
Are there CUDA stream configuration options in ONNX Runtime TensorRT EP that could help?
Is OpenCV VideoCapture single-threaded for HLS decoding? Could hardware NVDEC help (currently shows “off” in tegrastats)?
Memory bandwidth seems underutilized (23%) - is this a synchronization issue between CPU/GPU?
What I’ve Already Tried
✅ Verified CUDA/TensorRT are active (no CPU fallback)
✅ Multi-threaded architecture (13 threads per camera)
✅ Frame buffer = 1 (minimal latency)
✅ JPEG quality = 20 (low encoding overhead)
✅ All CPUs running at max frequency (2201 MHz)
What I’d Like to Achieve
Increase to 100+ cameras while maintaining 15+ FPS each
Better GPU utilization (target 85-95%)
Identify and remove the software bottleneck
Code Snippets
ONNX Session Creation:
// Simplified version
let cuda_provider = CUDAExecutionProvider::default()
.with_device_id(0);
let tensorrt_provider = TensorRTExecutionProvider::default()
.with_device_id(0)
.with_fp16();
session = Session::builder()?
.with_execution_providers([tensorrt_provider, cuda_provider])?
.with_model_from_file(model_path)?;
Inference Call:
// Multi-model parallel inference
let models_to_run = vec!["counting_model", "road_construction_model"];
let result = process_frame_with_inference(&frame, &model_pools, models_to_run).await?;
Any insights on optimizing multi-camera ONNX inference on Jetson would be greatly appreciated! Are there profiling tools (Nsight Systems?) or configuration tweaks specific to Jetson that could
help?
Thank you!
*** Please note that this reply is generated by LLM automatically ***
I’ll provide a comprehensive analysis of the issue and offer suggestions to improve performance.
Summary of the Problem
You’re running a real-time multi-camera monitoring system on a Jetson AGX Orin 64GB Developer Kit, experiencing a resource utilization bottleneck when adding more cameras. The system configuration includes a 12-core Cortex-A78AE CPU, Ampere GPU with 2048 CUDA cores, and a Rust application using ONNX Runtime with TensorRT EP. Despite having plenty of resources available, the FPS drops when adding new cameras, suggesting a software bottleneck.
Analysis
GPU Utilization: The GPU utilization is not saturating, even with 67 cameras, which suggests that the GPU is not the primary bottleneck.
CPU Usage: The CPU usage is relatively low, with only 32.9% of the total utilization, indicating that the CPU is not fully utilized.
Memory Bandwidth: The EMC Memory Controller is only at 23%, which is relatively low, indicating that memory bandwidth is not a major concern.
Pipeline Architecture: The pipeline architecture is complex, involving multiple stages, including HLS stream decoding, frame preprocessing, ONNX Runtime with TensorRT EP, YOLO detection, NMS, SORT tracking, zone processing, and MongoDB storage.
Potential Bottlenecks
TensorRT Stream Context Bottleneck: It’s possible that the TensorRT stream context is not efficiently managed, leading to a bottleneck in the ONNX Runtime TensorRT EP. This could be due to the round-robin session pool or the MODELPOOLSIZE32 per model.
Synchronization Issue between CPU and GPU: There might be a synchronization issue between the CPU and GPU, causing the GPU to wait for the CPU to finish processing, leading to a bottleneck.
HLS Decoding: HLS decoding might be a bottleneck, especially if the decoding is not optimized for the Jetson AGX Orin’s hardware.
Suggestions
Optimize TensorRT Stream Context: Investigate optimizing the TensorRT stream context management, such as using a more efficient session pool or increasing the MODELPOOLSIZE.
Use Hardware NVDEC: Consider using the hardware NVDEC for HLS decoding, which might be more efficient than software-based decoding.
Improve Synchronization between CPU and GPU: Investigate improving the synchronization between the CPU and GPU, such as using CUDA events or streams to reduce waiting times.
Profile and Optimize the Pipeline: Profile the pipeline to identify the most time-consuming stages and optimize those stages for better performance.
Increase GPU Utilization: Consider increasing the GPU utilization by adding more parallelism or using more GPU-intensive models.
Comparison with Other Options
Jetson Xavier NX: The Jetson Xavier NX has a similar architecture to the Jetson AGX Orin but with a more powerful GPU. If the bottleneck is indeed related to the GPU, the Jetson Xavier NX might provide better performance.
Other Deep Learning Accelerators: Other deep learning accelerators, such as Google’s Coral or Intel’s NCS2, might provide better performance for specific workloads. However, these accelerators might require significant changes to the existing pipeline.
By addressing the potential bottlenecks and optimizing the pipeline, you should be able to increase the FPS and improve the overall performance of the system.
*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***
use anyhow::Result;
use opencv::{prelude::*, videoio};
use tracing::{instrument, warn};
static FFMPEG_LOG_SUPPRESSED: std::sync::Once = std::sync::Once::new();
pub fn setup_camera_properties(cam: &mut videoio::VideoCapture) -> Result<()> {
let _ = cam.set(videoio::CAP_PROP_BUFFERSIZE, 0.0);
let _ = cam.set(
videoio::CAP_PROP_HW_ACCELERATION,
videoio::VIDEO_ACCELERATION_ANY as f64,
);
Ok(())
}
#[instrument(skip(url), fields(camera_id = %camera_id, url_length = url.len()))]
pub async fn connect_camera(url: &str, camera_id: &str) -> Result<videoio::VideoCapture> {
let start = std::time::Instant::now();
let max_connection_attempts = 3;
FFMPEG_LOG_SUPPRESSED.call_once(|| {
unsafe {
std::env::set_var(
"OPENCV_FFMPEG_CAPTURE_OPTIONS",
"fflags;nobuffer+fastseek+flush_packets|flags;low_delay|analyzeduration;100000|probesize;100000|max_delay;0|sync;ext",
);
std::env::set_var("OPENCV_LOG_LEVEL", "ERROR");
std::env::set_var("AV_LOG_FORCE_LEVEL", "32");
std::env::set_var("FFREPORT", "level=-8");
}
});
for connection_attempts in 0..max_connection_attempts {
if let Ok(mut new_cam) = videoio::VideoCapture::from_file(url, videoio::CAP_FFMPEG) {
if new_cam.is_opened().unwrap_or(false) && setup_camera_properties(&mut new_cam).is_ok()
{
let duration = start.elapsed().as_secs_f64();
tracing::Span::current().record("connection_attempts", connection_attempts + 1);
tracing::Span::current().record("connection_duration_secs", duration);
tracing::Span::current().record("connection_success", true);
return Ok(new_cam);
}
}
if connection_attempts < max_connection_attempts - 1 {
let delay = std::time::Duration::from_millis(500);
warn!(
"Camera {} connection attempt {} failed, retrying in {:?}...",
camera_id,
connection_attempts + 1,
delay
);
tokio::time::sleep(delay).await;
}
}
let duration = start.elapsed().as_secs_f64();
tracing::Span::current().record("connection_attempts", max_connection_attempts);
tracing::Span::current().record("connection_duration_secs", duration);
tracing::Span::current().record("connection_success", false);
Err(anyhow::anyhow!(
"Failed to connect to camera after {} attempts",
max_connection_attempts
))
}
No, I don’t use GStreamer SDK
I tried to use a larger model. Although it used maximum GPU resources, But I still saw this error when reading 69 cameras. (I don’t see it if I only use 4 cameras.)
Based on the error, it looks like you are hitting the decoder limitation.
Have you tried to turn off the inference to see if it also drops frames at the 68+ camera?