There are many examples of inference using context.execute_async_v2().
However, v2 has been deprecated and there are no examples anywhere using context.execute_async_v3(…).
The TensorRT developer page says to: Specify buffers for inputs and outputs with “context.set_tensor_address(name, ptr)”
The API has "context.set_input_shape(“name, tuple(input_batch.shape))” and “set_output_allocator()”, but after days of mucking around I have got nowhere.
Can some please provide an example or suggestion.
Thanks
First, you have to set input shape:
tensor_name = engine.get_tensor_name(0) # input tensor
context.set_input_shape(tensor_name, input_shape) # use your input_shape
assert context.all_binding_shapes_specified
Then set up input and output buffers (I use numpy arrays as input and output):
d_input = cuda.mem_alloc(np.prod(input_shape) * np.dtype(np.float32).itemsize)
d_output = cuda.mem_alloc(np.prod(output_shape) * np.dtype(np.float32).itemsize)
context.set_tensor_address(engine.get_tensor_name(0), int(d_input)) # input buffer
context.set_tensor_address(engine.get_tensor_name(1), int(d_output)) #output buffer
Then you can run inference:
cuda.memcpy_htod_async(d_input, input_data, stream) # put data to input
context.execute_async_v3(stream_handle=stream.handle)
2 Likes
Hi how do you copy the results back?
n = output_name
output_shape = self.engine.get_tensor_shape(n)
output_shape = (input_bs, *output_shape[1:])
d_output = int(cuda.mem_alloc(np.random.randn(*output_shape).astype(self.dtype).nbytes))
self.context.set_tensor_address(n, d_output)
output = np.empty(output_shape, dtype=self.dtype)
cuda.memcpy_dtoh_async(output, self.context.get_tensor_address(n), self.stream)
This is what I did but i keep getting pycuda._driver.LogicError: cuMemcpyDtoHAsync failed: an illegal memory access was encountered at dtoh line.
As I understand,
-
You reserve memory space by setting input and output buffers (d_input, d_output).
-
Then connect those buffers to input and output model tensors (contet.set_tensor_address).
Chapters 1 and 2 you do just once.
-
After that you put input data to input buffer (cuda.memcpy_htod_async).
-
Run inference (context.execute_async_v3)
-
And then you get output data from output buffer (cuda.memcpy_dtoh_asyn, ).
Chapters 3-5 you repeat every batch during inference.
1 Like
cuda.memcpy_dtoh_async(output, self.context.get_tensor_address(n), self.stream)
In your example you’re trying to get output result from output tensor instead of getting it from output buffer, as you have to.
I see, so I should be using
cuda.memcpy_dtoh_async(output, d_output, self.stream)
instead.
I previously thought once I have set the tensor address to d_output then i can reuse that address. So they are different.