Using inputs to the model that are already in the device

Description

In all the examples that I have seen about how to run inference with tensorrt we consider that the input of our model is on our host and that we want to return the output of the model to the host.

Are there any examples where the input of our model is already in the device and where the output of our model stays in the device to keep working with it?

I was wondering is this will improve the speed of our model rather than moving the inputs and outputs between the host and the device all the time.

Environment

TensorRT Version: 10.8
GPU Type: RTX A4500
Nvidia Driver Version: 12.4
CUDA Version: 12.4
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Hi @alvarocbc8 can you share a little more about how you’re using/executing TensorRT?

I found a couple of links that might be helpful –
If you’re using Triton Inference Server, there’s a shared memory extension that allows you to pass pointers to a memory region on either the host or device to your executing model.

I believe the same is true if you’re executing the model directly, you can pass a pointer to your input and output buffers which can be on the device: C++ API Documentation — NVIDIA TensorRT Documentation

This definitely can improve the model speed, especially if you have large input data or large batches of input data.

Hi Neal, thanks for your quick response.

I’m using TensorRT to run a custom model in Python.

I already have my model input on my GPU, so I just want to run my model and continue working with the model output on my GPU.

I’ve seen this example link, but it assumes the input is on the CPU, which isn’t the case for me.

I’ve worked around this issue by using “cuda.memcpy_dtod_async” instead of “cuda.memcpy_htod_async” and “cuda.memcpy_dtoh_async,” and targeting the pointer of my input, but the performance is worse than compiling my model with PyTorch’s “torch_tensorrt” library.

Also, when I try this approach for my inference, I see how other parts of my code slow down. Do you think this might be because we’re manually writing to memory?