I’m working on a project and trying to understand the best approach for running inference on 3 (or more) camera streams, or from a single high-resolution camera (4K or 8K). I’m using the D-FINE model, which I converted to ONNX using the provided export_onnx.py script. I then converted the ONNX model into a TensorRT engine using trtexec.
Multiple cameras
From what I read, batching is the preferred way to optimize inference when using multiple camera streams. However, while reading about batching, I also came across Dynamic Shapes, and I’m not entirely sure whether Dynamic Shapes are the same thing as batching. If not, what’s the difference?
I also have a question about how buffers should be provided to the engine. From my understanding, even when using 3, 4, or more cameras, I should still feed a single pointer/buffer into the engine, where that buffer contains the data for all camera frames.
For example, if I have 3 cameras, should I place the 3 frames into one contiguous buffer and then run inference using that single buffer? Or is there also an API that allows me to feed multiple independent sources into the engine at once?
One large-resolution stream
This is somewhat similar to the multiple-camera question above, but in this case I receive a single video frame with a resolution of, for example, 3840 × 2160.
The object detection model I’m using (D-FINE) only accepts 640 × 640 images, so I was thinking about slicing the large frame into multiple 640 × 640 tiles and feeding those into the engine. The reason I want to slice instead of scaling is because the camera is ceiling-mounted and positioned far away from the floor, so scaling the full frame down would likely make objects too small to detect reliably.
What I’m unsure about is how these slices should be provided to the engine in the most optimized way. Is batching the correct approach here as well?
Fixed input size
Are there additional optimizations I could apply when I know exactly how many camera streams will be fed into the engine?
Thanks!