Multiple inputs to tensorRT engine from a single input stream


I have a Deep learning Pytorch model with me to predict a certain activity from given input video. With respect to the model, I have converted that model to ONNX to TensorRT Engine file.

Now, To predict that particular activity, I am able to create a PyTorch pipeline with the model and three inputs with shape (1,3,12,640,640), (10,3,640,640) and (10,3,640,640).

The first input is basically 12 images combined with shape 3x640x640. The first input would be the set of continuous 12 images extracted from a single video stream. Other two inputs would be first 10 images and last 10 images from the first input. Iteratively, This pre-processing would be used to feed the three inputs to the model.

My objective is to replicate the same PyTorch pipeline in Deep stream using TensorRT ‘.engine’ file.

Can you please help me provide some light on this? What can we use from Deepstream or GStreamer or anything else, To Extract the set of continuous images from a single input video stream and process it further to feed it to the model in Deepstream pipeline.

Let me know if you need any other details or clarity from my end.

Any help/suggestions would be really appreciated.

Thank you,


Just want to clarify first.

Is your first input dimension (12,3,640,640)?
Or it is (1,3,12,640,640) as you list above which has five axes.

More, is the last 10 images indicate the frames right before the end of streams?
Or the last few frames in a pre-defined period?


Hi @AastaLLL

Thanks for the revert.

The model expects this input dimension > (1,3,12,640,640)

But, We take 12 continuous images with shape 3x640x640 and preprocess it (PyTorch permute and PyTorch unsqueeze) to make it (1,3,12,640,640) which is expected shape by the model.

Last 10 images are not the frames right before the end of streams. Eventually it will be for the last iteration. We are extracting it from this input only > (12,3,640,640)

Basically, We are considering last frames from the first input itself and so on. Something where a sliding window of 12 images moves forward with defined step. All the three inputs will be extracted from this sliding window of 12 images only until the stream ends.

Hence, From this shape > (12,3,640,640) We will extract the three inputs of the model. ((1,3,12,640,640), (10,3,640,640) and (10,3,640,640))

I hope I have answered your questions!

Let me know in case of any other clarity.


Hi @AastaLLL

Is there any update or suggestion from your end?