TensorRT waiting after inference seemingly for no reason

Description

I am tring to serve a model using TensorRT. My original model is in TorchScript, I converted it to ONNX and then to TensorRT. When profiling the inference using Nsys Profiler, I get the following output (I infer multiple times consecutively).

When zooming in on to the yellow parts in the TensorRT, I can match the name of the last operation to the last operation before the output in the ONNX model, when viewing it on Netron.

From this I conclude that by the time the last yellow tag is complete, the inference is done. This also correlates to the fact that TRT should be faster than the equivalent TorchScript model (I am benchmarking them here), which currently takes 14ms (only the yellow takes 6ms). And yet, Execute doesn’t return for another 146ms (_trt_infer is my inference function, attached, and enter is. a context manager I added to differ the execution from the other parts of the function).

Is this incredibly long waiting period intentional or am I doing something wrong? Has the inference not really ended yet and the waiting is the inference itself, or some kind of syncing? Or maybe TRT infers lazily and then the yellow is only queueing the operations or something?
I couldn’t find any documentation explaining this peculiar behaviour and I would most appreciate someone explaining to me what is going on, and hopefully removing this undesired behaviour.

Thank you very much in advance,

N

Environment

TensorRT Version: 8.4.2.4
GPU Type: T4
Nvidia Driver Version: 460.73.01
CUDA Version: 11.2
CUDNN Version: 8.2.0
Operating System + Version:
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.12.1
Baremetal or Container (if container which image + tag):

Relevant Files

session initialization code:

runtime = trt.Runtime(trt.Logger(trt.Logger.VERBOSE))
with open(trt_path, "rb") as f:
    serialized_engine = f.read()
    engine = runtime.deserialize_cuda_engine(serialized_engine)
    trt_context = engine.create_execution_context()

_trt_infer code:

@contextmanager
def nvtx_range(msg):
    depth = torch.cuda.nvtx.range_push(msg)
    try:
        yield depth
    finally:
        torch.cuda.nvtx.range_pop()

def _trt_infer(b: Tuple[np.ndarray]) -> torch.Tensor:
      """Runs inference on TRT engine

      :param b: np array of shape (I, B, ...) where ... represents dimensions of single input from a single input source
      """
      torch_b = tuple(torch.from_numpy(i).to("cuda") for i in b)
      input_idx = engine["INPUT0"]
      output_idx = engine["OUTPUT0"]
      buffers = [None] * 2
      trt_out = torch.empty(
          (torch_b[0].shape[0], 3, 1024, 1024), dtype=torch_b[0].dtype, device="cuda"
      )
      buffers[output_idx] = trt_out.data_ptr()
      buffers[input_idx] = torch_b[0].data_ptr()
      if not trt_context.set_binding_shape(input_idx, torch_b[0].shape):
          print("Failed to set binding shape")

      with nvtx_range("EXECV2"):
          success = trt_context.execute_v2(buffers)
      if not success:
          print("Failed to execute")
      return trt_out

profiling command:
nsys profile --cudabacktrace=true --cuda-memory-usage true -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o report${i} -f true -x true python infer.py

Hi,
Please refer to below links related custom plugin implementation and sample:

While IPluginV2 and IPluginV2Ext interfaces are still supported for backward compatibility with TensorRT 5.1 and 6.0.x respectively, however, we recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt or IPluginV2IOExt interfaces instead.

Thanks!

I may have not been clear with my question but this seems to be not very relevant to my question.
I want to understand what’s happening in the dead space after the yellow tags and before the orange Execute tag ends.
This inference is done on the TRT model, not the ONNX model. Furthermore, this model has no custom layers, nor does it implement any custom plugin.

Hi,

Could you please share with us the issue repro ONNX model, TRT engine building steps you followed, and inference script to try from our end for better debugging.

Thank you.

I can’t upload my TRT file as it is too big but it was generated using the following command:
trtexec --minShapes=INPUT0:1x18x512 --optShapes=INPUT0:2x18x512 --maxShapes=INPUT0:4x18x512 --onnx=model.onnx --saveEngine=model.engine
Note that I used TensorRT 8.3.4.1 for all of this (unlike what I wrote above).
This is my ONNX model model.onnx.zip (91.6 MB).

I sent my session initialization script above and my inference script is _trt_infer as above.
And I use it like this:

for _ in range(3):
    print(self._trt_infer([np.empty((2, 18, 512))]).detach().cpu().numpy().shape)

And get the following result (I don’t pay attention to the first one as it is essentially warmup).

You can see again the yellow bars are only a portion of the entire execution time.

Thank you!

Hi,

execute() call is synchronous, meaning that it only returns when the GPU job is done.
It seems, you’re looking at the CPU activities, so it looks like it’s just waiting.
Please expand the “26 streams hidden” under the “CUDA HW” rows to see the GPU activities.

Thank you.

All 26 of the hidden streams are completely empty - nothing in them.
What you are saying is that the GPU work is done after this very short amount of time, and then only CPU work is done? If so, what CPU work could be done? The entire model is on GPU and the memcpy is only the small red part in the end so we’re still left with a non-explained portion in the middle (the streamSync), do you know what it is?

Thank you!

Hi,

Could you please share with us the Nsys profiles?

Thank you.

Ok this is very weird. I accidentally deleted the previous .qdrep file and so I tried to reproduce the issue with the exact same code and now I get a lot of info in CUDA HW. Plus, the exported file is now .nsys-rep so I suppose Nsight auto updated and the issue was solved?
Either way, This is my Nsys report
report.nsys-rep (1.2 MB)
, note there is some code running before which is not interesting, only the last bit is (the one you already saw). From what I see here TRT just defers to a different stream and does everything there and waits for it to finish, and my problem was that for some reason Nsight wasn’t able to trace that stream?

Sorry for the confusion, I myself am not so sure what’s going on at this point.
Thank you

Please find the following information.

I see, this explains a lot, thank you!
So the TensorRT can only show what GPU jobs are being launched, and not what the GPU is currently doing (although that can be seen in a different place as you pointed out)?

Yes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.