How to optimize Rust multi-camera ONNX inference pipeline to utilize more GPU resources on Jetson AGX Orin 64GB?

I’m running a real-time multi-camera monitoring system on Jetson AGX Orin 64GB Developer Kit and experiencing a resource utilization bottleneck. Adding more cameras drops FPS but doesn’t increase
resource usage proportionally.

System Configuration

Hardware:

  • Jetson AGX Orin 64GB Developer Kit
  • JetPack 6.2.1 (L4T 36.4.4)
  • 12-core Cortex-A78AE CPU
  • Ampere GPU (2048 CUDA cores)

Software Stack:

  • Rust application using ONNX Runtime with TensorRT EP
  • CUDA execution provider with FP16 optimization
  • OpenCV for video decoding (HLS .m3u8 streams)
  • Multi-threaded architecture (877 threads for 67 cameras)

Current Performance

With 67 Cameras @ 10 FPS each (670 FPS total):

  • GPU Utilization: 63-65% (GR3D_FREQ @ 1287-1300 MHz)
  • GPU Memory: 2.66 GB / ~60 GB available
  • CPU Usage: 329% of 1200% (27% total utilization)
  • RAM: 8.1 GB / 62.8 GB (13%)
  • EMC (Memory Controller): 23% @ 3199 MHz
  • Power: 19.3W GPU + 7.8W CPU = ~34W total
  • Temperatures: GPU 62°C, CPU 65°C

Per Camera:

  • ~0.96% GPU utilization
  • ~40 MB GPU memory
  • ~4.9% CPU per core
  • ~124 MB RAM

Problem Statement

When I add new cameras (camera 68+), the FPS drops across all cameras but resource utilization remains nearly the same:

  • GPU still at 64-65% (not saturating)
  • CPU still at ~30-40%
  • Plenty of RAM available
  • Memory bandwidth (EMC) only at 23%

Expected: Resources should scale up with more cameras until hitting hardware limits.

Actual: FPS degrades while resources plateau, suggesting a software bottleneck rather than hardware limitation.

Pipeline Architecture

HLS Stream → OpenCV VideoCapture → Frame Preprocessing (resize 640x640, BGR→RGB)
→ ONNX Runtime (TensorRT EP) → YOLO Detection → NMS
→ SORT Tracking → Zone Processing → MongoDB Storage + WebSocket Streaming

Model Pool Configuration:

  • Multiple ONNX models loaded with TensorRT FP16
  • Round-robin session pool (MODEL_POOL_SIZE=32 per model)
  • Models: counting_model.onnx, road_construction_model.onnx
  • Parallel multi-model inference per frame

Processing Details:

  • Each camera runs in separate async loop
  • 1 frame buffer per camera
  • CUDA execution provider enabled
  • TensorRT graph optimization enabled

Questions

  1. Why isn’t GPU utilization scaling beyond 65%? Is there a TensorRT stream/context bottleneck?

  2. Are there CUDA stream configuration options in ONNX Runtime TensorRT EP that could help?

  3. Is OpenCV VideoCapture single-threaded for HLS decoding? Could hardware NVDEC help (currently shows “off” in tegrastats)?

  4. Memory bandwidth seems underutilized (23%) - is this a synchronization issue between CPU/GPU?

What I’ve Already Tried

  • ✅ Verified CUDA/TensorRT are active (no CPU fallback)
  • ✅ Multi-threaded architecture (13 threads per camera)
  • ✅ Frame buffer = 1 (minimal latency)
  • ✅ JPEG quality = 20 (low encoding overhead)
  • ✅ All CPUs running at max frequency (2201 MHz)

What I’d Like to Achieve

  • Increase to 100+ cameras while maintaining 15+ FPS each
  • Better GPU utilization (target 85-95%)
  • Identify and remove the software bottleneck

Code Snippets

ONNX Session Creation:

// Simplified version
let cuda_provider = CUDAExecutionProvider::default()
    .with_device_id(0);
let tensorrt_provider = TensorRTExecutionProvider::default()
    .with_device_id(0)
    .with_fp16();

session = Session::builder()?
    .with_execution_providers([tensorrt_provider, cuda_provider])?
    .with_model_from_file(model_path)?;

Inference Call:
// Multi-model parallel inference
let models_to_run = vec!["counting_model", "road_construction_model"];
let result = process_frame_with_inference(&frame, &model_pools, models_to_run).await?;

Any insights on optimizing multi-camera ONNX inference on Jetson would be greatly appreciated! Are there profiling tools (Nsight Systems?) or configuration tweaks specific to Jetson that could
help?

Thank you!



*** Please note that this reply is generated by LLM automatically ***

I’ll provide a comprehensive analysis of the issue and offer suggestions to improve performance.

Summary of the Problem

You’re running a real-time multi-camera monitoring system on a Jetson AGX Orin 64GB Developer Kit, experiencing a resource utilization bottleneck when adding more cameras. The system configuration includes a 12-core Cortex-A78AE CPU, Ampere GPU with 2048 CUDA cores, and a Rust application using ONNX Runtime with TensorRT EP. Despite having plenty of resources available, the FPS drops when adding new cameras, suggesting a software bottleneck.

Analysis

  1. GPU Utilization: The GPU utilization is not saturating, even with 67 cameras, which suggests that the GPU is not the primary bottleneck.
  2. CPU Usage: The CPU usage is relatively low, with only 32.9% of the total utilization, indicating that the CPU is not fully utilized.
  3. Memory Bandwidth: The EMC Memory Controller is only at 23%, which is relatively low, indicating that memory bandwidth is not a major concern.
  4. Pipeline Architecture: The pipeline architecture is complex, involving multiple stages, including HLS stream decoding, frame preprocessing, ONNX Runtime with TensorRT EP, YOLO detection, NMS, SORT tracking, zone processing, and MongoDB storage.

Potential Bottlenecks

  1. TensorRT Stream Context Bottleneck: It’s possible that the TensorRT stream context is not efficiently managed, leading to a bottleneck in the ONNX Runtime TensorRT EP. This could be due to the round-robin session pool or the MODELPOOLSIZE32 per model.
  2. Synchronization Issue between CPU and GPU: There might be a synchronization issue between the CPU and GPU, causing the GPU to wait for the CPU to finish processing, leading to a bottleneck.
  3. HLS Decoding: HLS decoding might be a bottleneck, especially if the decoding is not optimized for the Jetson AGX Orin’s hardware.

Suggestions

  1. Optimize TensorRT Stream Context: Investigate optimizing the TensorRT stream context management, such as using a more efficient session pool or increasing the MODELPOOLSIZE.
  2. Use Hardware NVDEC: Consider using the hardware NVDEC for HLS decoding, which might be more efficient than software-based decoding.
  3. Improve Synchronization between CPU and GPU: Investigate improving the synchronization between the CPU and GPU, such as using CUDA events or streams to reduce waiting times.
  4. Profile and Optimize the Pipeline: Profile the pipeline to identify the most time-consuming stages and optimize those stages for better performance.
  5. Increase GPU Utilization: Consider increasing the GPU utilization by adding more parallelism or using more GPU-intensive models.

Comparison with Other Options

  1. Jetson Xavier NX: The Jetson Xavier NX has a similar architecture to the Jetson AGX Orin but with a more powerful GPU. If the bottleneck is indeed related to the GPU, the Jetson Xavier NX might provide better performance.
  2. Other Deep Learning Accelerators: Other deep learning accelerators, such as Google’s Coral or Intel’s NCS2, might provide better performance for specific workloads. However, these accelerators might require significant changes to the existing pipeline.

By addressing the potential bottlenecks and optimizing the pipeline, you should be able to increase the FPS and improve the overall performance of the system.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,

Could you share how you read the camera with OpenCV?
Do you use the GStreamer SDK?

Please also try to remove the inference part to see if the limitation comes from the camera or web streaming itself.

Thanks.

This is how I read camera with OpenCV

use anyhow::Result;
use opencv::{prelude::*, videoio};
use tracing::{instrument, warn};

static FFMPEG_LOG_SUPPRESSED: std::sync::Once = std::sync::Once::new();

pub fn setup_camera_properties(cam: &mut videoio::VideoCapture) -> Result<()> {
    let _ = cam.set(videoio::CAP_PROP_BUFFERSIZE, 0.0);
    let _ = cam.set(
        videoio::CAP_PROP_HW_ACCELERATION,
        videoio::VIDEO_ACCELERATION_ANY as f64,
    );

    Ok(())
}

#[instrument(skip(url), fields(camera_id = %camera_id, url_length = url.len()))]
pub async fn connect_camera(url: &str, camera_id: &str) -> Result<videoio::VideoCapture> {
    let start = std::time::Instant::now();
    let max_connection_attempts = 3;

    FFMPEG_LOG_SUPPRESSED.call_once(|| {
        unsafe {
            std::env::set_var(
                "OPENCV_FFMPEG_CAPTURE_OPTIONS",
                "fflags;nobuffer+fastseek+flush_packets|flags;low_delay|analyzeduration;100000|probesize;100000|max_delay;0|sync;ext",
            );
            std::env::set_var("OPENCV_LOG_LEVEL", "ERROR");
            std::env::set_var("AV_LOG_FORCE_LEVEL", "32");
            std::env::set_var("FFREPORT", "level=-8");
        }
    });

    for connection_attempts in 0..max_connection_attempts {
        if let Ok(mut new_cam) = videoio::VideoCapture::from_file(url, videoio::CAP_FFMPEG) {
            if new_cam.is_opened().unwrap_or(false) && setup_camera_properties(&mut new_cam).is_ok()
            {
                let duration = start.elapsed().as_secs_f64();
                tracing::Span::current().record("connection_attempts", connection_attempts + 1);
                tracing::Span::current().record("connection_duration_secs", duration);
                tracing::Span::current().record("connection_success", true);
                return Ok(new_cam);
            }
        }

        if connection_attempts < max_connection_attempts - 1 {
            let delay = std::time::Duration::from_millis(500);
            warn!(
                "Camera {} connection attempt {} failed, retrying in {:?}...",
                camera_id,
                connection_attempts + 1,
                delay
            );
            tokio::time::sleep(delay).await;
        }
    }

    let duration = start.elapsed().as_secs_f64();
    tracing::Span::current().record("connection_attempts", max_connection_attempts);
    tracing::Span::current().record("connection_duration_secs", duration);
    tracing::Span::current().record("connection_success", false);

    Err(anyhow::anyhow!(
        "Failed to connect to camera after {} attempts",
        max_connection_attempts
    ))
}

No, I don’t use GStreamer SDK
I tried to use a larger model. Although it used maximum GPU resources, But I still saw this error when reading 69 cameras. (I don’t see it if I only use 4 cameras.)
image

Hi,

Based on the error, it looks like you are hitting the decoder limitation.
Have you tried to turn off the inference to see if it also drops frames at the 68+ camera?

Thanks.

Is this still an issue to support? Any result can be shared?