TensorRT waiting after inference seemingly for no reason


I am tring to serve a model using TensorRT. My original model is in TorchScript, I converted it to ONNX and then to TensorRT. When profiling the inference using Nsys Profiler, I get the following output (I infer multiple times consecutively).

When zooming in on to the yellow parts in the TensorRT, I can match the name of the last operation to the last operation before the output in the ONNX model, when viewing it on Netron.

From this I conclude that by the time the last yellow tag is complete, the inference is done. This also correlates to the fact that TRT should be faster than the equivalent TorchScript model (I am benchmarking them here), which currently takes 14ms (only the yellow takes 6ms). And yet, Execute doesn’t return for another 146ms (_trt_infer is my inference function, attached, and enter is. a context manager I added to differ the execution from the other parts of the function).

Is this incredibly long waiting period intentional or am I doing something wrong? Has the inference not really ended yet and the waiting is the inference itself, or some kind of syncing? Or maybe TRT infers lazily and then the yellow is only queueing the operations or something?
I couldn’t find any documentation explaining this peculiar behaviour and I would most appreciate someone explaining to me what is going on, and hopefully removing this undesired behaviour.

Thank you very much in advance,



TensorRT Version:
GPU Type: T4
Nvidia Driver Version: 460.73.01
CUDA Version: 11.2
CUDNN Version: 8.2.0
Operating System + Version:
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.12.1
Baremetal or Container (if container which image + tag):

Relevant Files

session initialization code:

runtime = trt.Runtime(trt.Logger(trt.Logger.VERBOSE))
with open(trt_path, "rb") as f:
    serialized_engine = f.read()
    engine = runtime.deserialize_cuda_engine(serialized_engine)
    trt_context = engine.create_execution_context()

_trt_infer code:

def nvtx_range(msg):
    depth = torch.cuda.nvtx.range_push(msg)
        yield depth

def _trt_infer(b: Tuple[np.ndarray]) -> torch.Tensor:
      """Runs inference on TRT engine

      :param b: np array of shape (I, B, ...) where ... represents dimensions of single input from a single input source
      torch_b = tuple(torch.from_numpy(i).to("cuda") for i in b)
      input_idx = engine["INPUT0"]
      output_idx = engine["OUTPUT0"]
      buffers = [None] * 2
      trt_out = torch.empty(
          (torch_b[0].shape[0], 3, 1024, 1024), dtype=torch_b[0].dtype, device="cuda"
      buffers[output_idx] = trt_out.data_ptr()
      buffers[input_idx] = torch_b[0].data_ptr()
      if not trt_context.set_binding_shape(input_idx, torch_b[0].shape):
          print("Failed to set binding shape")

      with nvtx_range("EXECV2"):
          success = trt_context.execute_v2(buffers)
      if not success:
          print("Failed to execute")
      return trt_out

profiling command:
nsys profile --cudabacktrace=true --cuda-memory-usage true -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o report${i} -f true -x true python infer.py

Please refer to below links related custom plugin implementation and sample:

While IPluginV2 and IPluginV2Ext interfaces are still supported for backward compatibility with TensorRT 5.1 and 6.0.x respectively, however, we recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt or IPluginV2IOExt interfaces instead.


I may have not been clear with my question but this seems to be not very relevant to my question.
I want to understand what’s happening in the dead space after the yellow tags and before the orange Execute tag ends.
This inference is done on the TRT model, not the ONNX model. Furthermore, this model has no custom layers, nor does it implement any custom plugin.


Could you please share with us the issue repro ONNX model, TRT engine building steps you followed, and inference script to try from our end for better debugging.

Thank you.

I can’t upload my TRT file as it is too big but it was generated using the following command:
trtexec --minShapes=INPUT0:1x18x512 --optShapes=INPUT0:2x18x512 --maxShapes=INPUT0:4x18x512 --onnx=model.onnx --saveEngine=model.engine
Note that I used TensorRT for all of this (unlike what I wrote above).
This is my ONNX model model.onnx.zip (91.6 MB).

I sent my session initialization script above and my inference script is _trt_infer as above.
And I use it like this:

for _ in range(3):
    print(self._trt_infer([np.empty((2, 18, 512))]).detach().cpu().numpy().shape)

And get the following result (I don’t pay attention to the first one as it is essentially warmup).

You can see again the yellow bars are only a portion of the entire execution time.

Thank you!