Adivce regarding batch size, multiple cameras and large image buffers

I’m working on a project and trying to understand the best approach for running inference on 3 (or more) camera streams, or from a single high-resolution camera (4K or 8K). I’m using the D-FINE model, which I converted to ONNX using the provided export_onnx.py script. I then converted the ONNX model into a TensorRT engine using trtexec.

Multiple cameras

From what I read, batching is the preferred way to optimize inference when using multiple camera streams. However, while reading about batching, I also came across Dynamic Shapes, and I’m not entirely sure whether Dynamic Shapes are the same thing as batching. If not, what’s the difference?

I also have a question about how buffers should be provided to the engine. From my understanding, even when using 3, 4, or more cameras, I should still feed a single pointer/buffer into the engine, where that buffer contains the data for all camera frames.

For example, if I have 3 cameras, should I place the 3 frames into one contiguous buffer and then run inference using that single buffer? Or is there also an API that allows me to feed multiple independent sources into the engine at once?

One large-resolution stream

This is somewhat similar to the multiple-camera question above, but in this case I receive a single video frame with a resolution of, for example, 3840 × 2160.

The object detection model I’m using (D-FINE) only accepts 640 × 640 images, so I was thinking about slicing the large frame into multiple 640 × 640 tiles and feeding those into the engine. The reason I want to slice instead of scaling is because the camera is ceiling-mounted and positioned far away from the floor, so scaling the full frame down would likely make objects too small to detect reliably.

What I’m unsure about is how these slices should be provided to the engine in the most optimized way. Is batching the correct approach here as well?

Fixed input size

Are there additional optimizations I could apply when I know exactly how many camera streams will be fed into the engine?

Thanks!

*** Please note that this reply is generated by LLM automatically ***

I’ll address each of your questions in detail.

Batching vs. Dynamic Shapes

Batching and Dynamic Shapes are related but distinct concepts in TensorRT.

Batching refers to the process of grouping multiple inputs together to form a single input tensor, which is then processed by the network. This is useful when you have multiple camera streams or frames that need to be processed simultaneously. By batching these inputs, you can reduce the overhead of launching multiple inference requests and improve overall performance.

Dynamic Shapes, on the other hand, refer to the ability of a TensorRT engine to accept input tensors with varying shapes, including batch sizes. This means that a single engine can be used to process inputs with different batch sizes, without the need to recompile the engine.

In your case, if you’re working with multiple camera streams, batching is likely the preferred approach. However, if you need to process inputs with varying batch sizes or shapes, Dynamic Shapes might be a better fit.

Providing Buffers to the Engine

When using multiple camera streams, you should feed a single pointer buffer into the engine, where that buffer contains the data for all camera frames. This is because TensorRT engines expect a single input tensor, which can be a batch of multiple inputs.

For example, if you have 3 cameras, you should place the 3 frames into one contiguous buffer and then run inference using that single buffer. You can use the cudaMalloc function to allocate a buffer that’s large enough to hold all the frames, and then copy the frames into that buffer using cudaMemcpy.

There isn’t a specific API that allows you to feed multiple independent sources into the engine at once. However, you can use the cudaStream API to launch multiple inference requests concurrently, which can help improve performance.

Processing Large-Resolution Streams

When processing a single large-resolution stream, slicing the frame into multiple smaller tiles is a good approach. This is because the object detection model you’re using (DFINE) only accepts 640x640 images, and scaling the full frame down might make objects too small to detect reliably.

To provide these slices to the engine, you can use batching again. Create a buffer that’s large enough to hold all the slices, and then copy the slices into that buffer using cudaMemcpy. You can then run inference on that buffer using the TensorRT engine.

Fixed Input Size

If you know exactly how many camera streams will be fed into the engine, you can optimize the engine for that specific batch size. This can help improve performance by reducing the overhead of dynamic memory allocation and deallocation.

To optimize the engine for a fixed input size, you can use the nvinfer1::Builder API to set the batch size explicitly. For example:

nvinfer1::Builder builder;
builder.setBatchSize(3); // Set the batch size to 3

This will optimize the engine for a batch size of 3, which can help improve performance.

Additional Optimizations

Here are some additional optimizations you can apply:

  1. Use FP16 precision: If your GPU supports FP16 precision, you can use it to reduce memory bandwidth and improve performance.
  2. Use TensorRT’s built-in optimizations: TensorRT provides several built-in optimizations, such as kernel fusion and constant folding, which can help improve performance.
  3. Use a larger batch size: If you have a large number of camera streams, using a larger batch size can help improve performance by reducing the overhead of launching multiple inference requests.
  4. Use multiple GPUs: If you have multiple GPUs available, you can use them to process multiple camera streams concurrently, which can help improve performance.

I hope this helps! Let me know if you have any further questions.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Great answer, but if someone could verify that the above generated content is correct that would be great. It seems very logical but it’s generated by a LLM.

After reading this answer it’s clear that I should provide one large buffer when I e.g. have multiple cameras. I was hoping that I could provide pointers to the data for each batch as that prevents copying memory around.

Especially because I’ve been thinking about using cameras that support GPUDirect. When I have 3 cameras using GPUDirect I don’t have to transfer memory from CPU to GPU, but because of how batching works I still need to combine them into one large buffer which feels like bad design. GPU->GPU memory transfers pretty fast, but not doing a memory transfer is always faster of course.

Still looking forward to get advice form someone who has experience with this.

Hi @diederick, you were right to push back on the auto-triage reply, and your GPUDirect intuition is on the right track. Let me unpack each part properly, and flag a couple of things the earlier post got wrong.

1. Batching vs dynamic shapes (they’re not the same thing)

  • Batching is what you feed the engine: a single tensor whose first dim is N independent inputs (your N cameras’ frames, or N tiles).

  • Dynamic shapes is how the engine was built: an engine built with optimization profiles can accept inputs whose shape varies at runtime (including the batch dim, but also H, W, sequence length, etc.).

You almost certainly want explicit batching plus dynamic shapes, where the dynamic dim is the batch. Build the engine with one optimization profile that covers your expected batch range, e.g.:


trtexec --onnx=dfine.onnx --saveEngine=dfine.engine --fp16 \

--minShapes=images:1x3x640x640 \

--optShapes=images:4x3x640x640 \

--maxShapes=images:8x3x640x640

(Pick optShapes close to the batch you’ll see most often, since TRT picks the tactics for that point.)

A small correction on the earlier reply: builder.setBatchSize(...) is the deprecated implicit-batch API from TRT 7.x. Don’t use that surface. Modern TRT builds use IBuilder::createNetworkV2 with the kEXPLICIT_BATCH flag (which is now the only flag, the old explicit/implicit toggle was removed). So the batch dim is just the first dim of your input tensor like any other.

2. Multi-camera buffer layout, GPUDirect-friendly

You do feed a single device pointer into the engine for the input binding, but that does NOT mean you have to copy the per-camera frames into a contiguous buffer at runtime. You allocate the batched buffer once at startup and have each camera DMA directly into its slot inside it.

Concretely, for 3 cameras at 3x640x640 FP16:

  1. Allocate one cudaMalloc-backed device buffer of shape [3, 3, 640, 640]. Call its base pointer d_input.

  2. Per camera i, the slot pointer is d_input + i * (3*640*640*sizeof(half)).

  3. Configure each GPUDirect-RDMA camera to land its frame straight into its slot pointer. No CPU bounce, no GPU-to-GPU copy.

  4. Call enqueueV3 once with d_input and your output buffer.

That’s the zero-copy pattern, the engine still sees one input pointer (which is what TRT requires), but you never moved a byte. The “feed each camera independently” API doesn’t exist because TRT’s perf model is built around one fused launch over the whole batch, but the buffer-side trick above gives you the same physical behavior.

If the per-camera arrival timing is asynchronous, double-buffer the device input and use a CUDA event per camera to know when its slot is filled before the next enqueueV3.

3. 4K/8K tiling for the single high-res camera

Tiling into 640x640 patches and batching is the right call for D-FINE since you don’t want to lose small-object pixels by downscaling. Same buffer pattern as #2: allocate one big [N_tiles, 3, 640, 640] device buffer once, and run a small CUDA kernel (or cudaMemcpy2D with stride tricks) to slice the 4K frame into the tile slots. Both options are pure GPU-side, so no PCIe cost.

If you find you’re tiling beyond what one engine call can hold, two practical tricks:

  • Allow tile overlap (e.g. 64-pixel border) so detections that straddle the tile boundary still resolve. NMS the union afterwards.

  • If tile count varies frame-to-frame, build with a wider dynamic-batch profile so you don’t need separate engines.

4. Fixed input size optimizations

If you do know the camera count up front and never deviate, drop dynamic shapes entirely and build a static engine with min == opt == max == [N, 3, 640, 640]. TRT can pick more aggressive tactics when shapes are fully fixed (no shape-dispatch overhead, sometimes better fusion choices). It’s usually a small but real win on detection workloads. You can validate the delta by comparing --profilingVerbosity=detailed --dumpProfile between the static and dynamic builds.

One follow-up: which TRT version are you on, and which D-FINE export are you using (the official one from the D-FINE repo, or a custom one)? That’ll determine whether there are any version-specific gotchas worth flagging on the optimization profile or the postprocessing graph.

Thanks, Atharva