TensorRT V10 inference using context.execute_async_v3()

There are many examples of inference using context.execute_async_v2().
However, v2 has been deprecated and there are no examples anywhere using context.execute_async_v3(…).

The TensorRT developer page says to: Specify buffers for inputs and outputs with “context.set_tensor_address(name, ptr)”

The API has "context.set_input_shape(“name, tuple(input_batch.shape))” and “set_output_allocator()”, but after days of mucking around I have got nowhere.

Can some please provide an example or suggestion.

Thanks

First, you have to set input shape:

tensor_name = engine.get_tensor_name(0) # input tensor
context.set_input_shape(tensor_name, input_shape) # use your input_shape
assert context.all_binding_shapes_specified

Then set up input and output buffers (I use numpy arrays as input and output):

d_input = cuda.mem_alloc(np.prod(input_shape) * np.dtype(np.float32).itemsize)
d_output = cuda.mem_alloc(np.prod(output_shape) * np.dtype(np.float32).itemsize)
context.set_tensor_address(engine.get_tensor_name(0), int(d_input)) # input buffer
context.set_tensor_address(engine.get_tensor_name(1), int(d_output)) #output buffer

Then you can run inference:

cuda.memcpy_htod_async(d_input, input_data, stream) # put data to input
context.execute_async_v3(stream_handle=stream.handle)

2 Likes

Hi how do you copy the results back?

            n = output_name
            output_shape = self.engine.get_tensor_shape(n)
            output_shape = (input_bs, *output_shape[1:])
            d_output = int(cuda.mem_alloc(np.random.randn(*output_shape).astype(self.dtype).nbytes))
            self.context.set_tensor_address(n, d_output)
            output = np.empty(output_shape, dtype=self.dtype)
            cuda.memcpy_dtoh_async(output, self.context.get_tensor_address(n), self.stream)

This is what I did but i keep getting pycuda._driver.LogicError: cuMemcpyDtoHAsync failed: an illegal memory access was encountered at dtoh line.

output_data = np.empty(output_shape, dtype=np.float32) # create numpy array to keep output data
cuda.memcpy_dtoh_async(output_data, d_output, stream) # put output data to created numpy array from output buffer

As I understand,

  1. You reserve memory space by setting input and output buffers (d_input, d_output).

  2. Then connect those buffers to input and output model tensors (contet.set_tensor_address).

Chapters 1 and 2 you do just once.

  1. After that you put input data to input buffer (cuda.memcpy_htod_async).

  2. Run inference (context.execute_async_v3)

  3. And then you get output data from output buffer (cuda.memcpy_dtoh_asyn, ).

Chapters 3-5 you repeat every batch during inference.

1 Like
cuda.memcpy_dtoh_async(output, self.context.get_tensor_address(n), self.stream)

In your example you’re trying to get output result from output tensor instead of getting it from output buffer, as you have to.

I see, so I should be using

cuda.memcpy_dtoh_async(output, d_output, self.stream)

instead.

I previously thought once I have set the tensor address to d_output then i can reuse that address. So they are different.