Confusion about TensorRT stream.synchronize() in GPU-only inference


Hi, I’m trying to use tersorrt inference in a GPU-only pipeline:

def create_output_buffers(self, batch_size):

        outputs = [None] * len(self.output_names)
        for i, output_name in enumerate(self.output_names):
            idx = self.engine.get_binding_index(output_name)
            dtype = torch_dtype_from_trt(self.engine.get_binding_dtype(idx))
            if self.final_shapes is not None:
                shape = (batch_size, ) + self.final_shapes[i]
                shape = (batch_size, ) + tuple(self.engine.get_binding_shape(idx))
            device = self.torch_device_from_trt(self.engine.get_location(idx))
            output = torch.empty(size=shape, dtype=dtype, device=device)
            outputs[i] = output
        return outputs

    def execute(self, *inputs):

        batch_size = inputs[0].shape[0]

        bindings = [None] * (len(self.input_names) + len(self.output_names))

        # map input bindings
        inputs_torch = [None] * len(self.input_names)
        for i, name in enumerate(self.input_names):
            idx = self.engine.get_binding_index(name)

            inputs_torch[i] = inputs[i].to(self.torch_device_from_trt(self.engine.get_location(idx)))
            inputs_torch[i] = inputs_torch[i].type(torch_dtype_from_trt(self.engine.get_binding_dtype(idx)))

            bindings[idx] = int(inputs_torch[i].data_ptr())

        output_buffers = self.create_output_buffers(batch_size)

        # map output bindings
        for i, name in enumerate(self.output_names):
            idx = self.engine.get_binding_index(name)
            bindings[idx] = int(output_buffers[i].data_ptr())

        context = self.context.get(timeout=10)
        context.execute_async_v2(bindings, torch.cuda.current_stream().cuda_stream)

        outputs = output_buffers
        # torch.cuda.current_stream().synchronize()

        return outputs

as showing above, the input is a torch tensor that already in GPU device, the trt execute function only do the inference with no data d2h or h2d

As i see, torch.cuda.current_stream().synchronize() is used for hold cpu until d2h is finished, so I’m confused is it necessary to do torch.cuda.current_stream().synchronize() here when there is no data transfer between cpu and gpu?


TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered


Hope the following information may help you.

Thank you.