Direct GPU Inference


I am looking to perform inference with inputs already in the GPU in the C++ API. Is there a method to map the input tensor to a cuda memory pointer? Also, can is it possible for the output to continue to reside in the GPU?


TensorRT Version: 7
CUDA Version: 10.2
CUDNN Version: 7.6
Operating System + Version: Windows 10

TensorRT never uses CPU memory as network input/output, user must offer GPU address to the binding.
You should be able to use cuda device memory as bindings directly if they already exist, instead of doing the memcpy host->device


@SunilJB Thanks for the reply.

I found the following post: Questions about efficient memory management for TensorRT on TX2 - #6 by Beerend

Here the OP suggested:

void ObjectDetector::runInference() {
util::Logger log("ObjectDetector::runInference");

trt_context->enqueue(batch_size, &trt_input_gpu, cuda_stream, nullptr);


Would this be the correct implementation? How would the context know how many elements to consider as the input tensor?

Would this be the correct implementation?

It worked but it’s not correct per-se :) It’s been a while and I managed to fix some issues by now (details below).

How would the context know how many elements to consider as the input tensor?

Short answer, the TensorRT engine just knows and it will try to treat whatever you jam into the function as if it were valid memory. If it’s not, you’ll be seeing some segmentation faults pass by :) So the answer is, ask the engine what it needs.

In more detail:
In fact looking at it again, the code you quote me on worked but it’s by accident because it’s actually incorrect. The signature of the enqueue function is as follows:

virtual bool nvinfer1::IExecutionContext::enqueue(int 	batchSize, void** bindings, cudaStream_t stream, cudaEvent_t* inputConsumed)

As ‘bindings’ I seemed to be passing the address to a single void* where I allocated input memory. Because of struct layout rules in the end things worked out but this was by accident, not by writing decent code.

To do it correctly, nowadays I’m keeping an std::vector<void*> where I store the pointers in their correct order as given by the trt_engine, and I pass the pointer to the first element in that vector to the enqueue function.

So in my use-case I’ll ask the de-serialized TensorRT engine about its inputs and outputs, I will allocate memory and fill in the bindings vector in the correct order. I’m posting an excerpt of my code below.

In your case you mention that you already have the inputs/outputs allocated by some other part of your program. This is not a problem (as long as is is GPU memory): you have pointers to these memory locations, just fill them in correctly in the vector or array of ‘bindings’ that the Engine requires (by asking it, in a similar way as I do).

Furthermore you need to be sure that:

  1. this memory has the correct size (in bytes)
  2. the type is correct (float, uchar, …)
  3. in the case of images: the pixels are layed out in the correct order.

Good luck! I hope this explanation is helpful to you.

    void ObjectDetectorImpl::allocateBuffers() {
        // Figure out how many input/outputs there are.
        const int num_bindings = trt_engine->getNbBindings();

        // Allocate managed memory for the network input and output
        // and find out what the input resolution of the images is.
        for (int idx = 0; idx < num_bindings; ++idx) {
            auto dim = trt_engine->getBindingDimensions(idx);
            Expects(dim.nbDims == 3);

            DimensionCHW dim_chw{dim.d[0], dim.d[1], dim.d[2]};

            if (trt_engine->bindingIsInput(idx)) {
                input_dimension = dim_chw;
                void* trt_input_ptr = nullptr;
                NV_CUDA_CHECK(cudaHostAlloc(&trt_input_ptr, dim_chw.size() * sizeof(float), cudaHostAllocMapped));
                // Gotta keep track of these void* in the order in which they appear as 'binding' in the Engine
                // I keep a std::vector<void*> bindings and I'll pass to the engine's enqueue function. 
            } else {
                // Do somthing similar for outputs
                // Gotta keep track of which is which cause you'll be writing to one and reading from another

@Beerend Thanks for your detailed reply. I have been experimenting with the interface with the simple mnist onnx example to better understand the basics of the inference engine. This example runs a synchronous inference, which is slightly different from your application.

The original program works as follows. An onnx model was defined and imported from a file, and an engine was built was from that:

   mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
    builder->buildEngineWithConfig(*network, *config), samplesCommon::InferDeleter());

From this a buffer manager object and execution context is created:

// Create RAII buffer manager object
samplesCommon::BufferManager buffers(mEngine, mParams.batchSize);

auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());

From here an input is read from a text file and put onto the host buffer (on CPU side to my understanding):

readPGMFile(locateFile(std::to_string(mNumber) + ".pgm", mParams.dataDirs),, inputH, inputW)

float* hostDataBuffer = 	static_cast<float*>(buffers.getHostBuffer(mParams.inputTensorNames[0]));
for (int i = 0; i < inputH * inputW; i++)
    hostDataBuffer[i] = 1.0 - float(fileData[i] / 255.0);

Then the input data is copied to the gpu, inference is run, and the output data is copied from gpu to cpu:


bool status = context->executeV2(buffers.getDeviceBindings().data());
if (!status)
    return false;


In the above, “status” returns as true. So with this as a baseline, I have been doing the following experimentation.

I already have loaded the same data into the GPU with pointer name “gpuTensorInput” and have set a desired output with pointer name “tensorOutput”.

I have first looked at the buffer object:

“buffers” is a void** pointer with two pointers in the mDeviceBindings

  • 0x0000000c04800e00 : location of input
  • 0x0000000c04801c00 : location of output

As a test, I ran the following lines

auto a = buffers.getDeviceBindings().data();
auto b = buffers.getDeviceBindings();

I saw the following:

a is a void** that holds the address of the tensor input 0x0000000c04800e00

b is a vector of pointers that holds the addresses of the tensor input (0x0000000c04800e00) and output (0x0000000c04801c00).

From here, I tried implementing various approaches. First, I used this link as a guide: Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

int inputIndex = mEngine->getBindingIndex("Input3");
int outputIndex = mEngine->getBindingIndex("Plus214_Output_0");

void* buf[2];
buf[0] = &gpuTensorInput;
buf[1] = &tensorOutput;
bool status = context->executeV2(buf);

This did not work, and “status” returned false.

I also tried experimenting with the following:

auto c = buffers.getDeviceBuffer(mParams.inputTensorNames[0]);
c = &gpuTensorInput;
bool status = context->executeV2(&c);

This also did not work and returned false for “status”.

I also tried:

buffers.getDeviceBindings()[0] = &gpuTensorInput;
buffers.getDeviceBindings()[1] = &tensorOutput;
bool status = context->executeV2(buffers.getDeviceBindings().data());

But I got the error: TRT] C:\source\rtExt\engine.cpp (902) - Cuda Error in nvinfer1::rt::ExecutionContext::executeInternal: 700 (an illegal memory access was encountered)

I believe I have to basically try to manipulate the pointer values in “buffer” but I am unsure how to do so.

int inputIndex = mEngine->getBindingIndex("Input3");
int outputIndex = mEngine->getBindingIndex("Plus214_Output_0");
void* buf[2];
buf[0] = &gpuTensorInput;
buf[1] = &tensorOutput;
bool status = context->executeV2(buf);

This looks reasonable but I’d make the following changes (compile in debug to have the assertions checked):

int inputIndex = mEngine->getBindingIndex("Input3");
int outputIndex = mEngine->getBindingIndex("Plus214_Output_0");

// Double check that your indices are 0 or 1 (can be -1 is the layer you named is not a binding)
assert(inputIndex == 0 || inputIndex == 1);
assert(outputIndex == 0 || outputIndex == 1);

// Double check that the layers you named are really input and output
assert(mEngine->bindingIsInput(inputIndex) == true);
assert(mEngine->bindingIsInput(outputIndex) == false);

void* buf[2];
// Use the index you queried (don't assume 0 and 1)
buf[inputIndex] = gpuTensorInput; // Assuming gpuTensorInput is a void* (no need to take the address of it.)
buf[outputIndex] = tensorOutput; // Assuming tensorOutput is a void* (no need to take the address of it.)
bool status = context->executeV2(buf);

@Beerend Thank you very much for your assistance :)