TensorRT waiting after inference seemingly for no reason

nrak · September 22, 2022, 8:58am

Description

I am tring to serve a model using TensorRT. My original model is in TorchScript, I converted it to ONNX and then to TensorRT. When profiling the inference using Nsys Profiler, I get the following output (I infer multiple times consecutively).

When zooming in on to the yellow parts in the TensorRT, I can match the name of the last operation to the last operation before the output in the ONNX model, when viewing it on Netron.

From this I conclude that by the time the last yellow tag is complete, the inference is done. This also correlates to the fact that TRT should be faster than the equivalent TorchScript model (I am benchmarking them here), which currently takes 14ms (only the yellow takes 6ms). And yet, Execute doesn’t return for another 146ms (_trt_infer is my inference function, attached, and enter is. a context manager I added to differ the execution from the other parts of the function).

Is this incredibly long waiting period intentional or am I doing something wrong? Has the inference not really ended yet and the waiting is the inference itself, or some kind of syncing? Or maybe TRT infers lazily and then the yellow is only queueing the operations or something?
I couldn’t find any documentation explaining this peculiar behaviour and I would most appreciate someone explaining to me what is going on, and hopefully removing this undesired behaviour.

Thank you very much in advance,

N

Environment

TensorRT Version: 8.4.2.4
GPU Type: T4
Nvidia Driver Version: 460.73.01
CUDA Version: 11.2
CUDNN Version: 8.2.0
Operating System + Version:
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.12.1
Baremetal or Container (if container which image + tag):

Relevant Files

session initialization code:

runtime = trt.Runtime(trt.Logger(trt.Logger.VERBOSE))
with open(trt_path, "rb") as f:
    serialized_engine = f.read()
    engine = runtime.deserialize_cuda_engine(serialized_engine)
    trt_context = engine.create_execution_context()

_trt_infer code:

@contextmanager
def nvtx_range(msg):
    depth = torch.cuda.nvtx.range_push(msg)
    try:
        yield depth
    finally:
        torch.cuda.nvtx.range_pop()

def _trt_infer(b: Tuple[np.ndarray]) -> torch.Tensor:
      """Runs inference on TRT engine

      :param b: np array of shape (I, B, ...) where ... represents dimensions of single input from a single input source
      """
      torch_b = tuple(torch.from_numpy(i).to("cuda") for i in b)
      input_idx = engine["INPUT0"]
      output_idx = engine["OUTPUT0"]
      buffers = [None] * 2
      trt_out = torch.empty(
          (torch_b[0].shape[0], 3, 1024, 1024), dtype=torch_b[0].dtype, device="cuda"
      )
      buffers[output_idx] = trt_out.data_ptr()
      buffers[input_idx] = torch_b[0].data_ptr()
      if not trt_context.set_binding_shape(input_idx, torch_b[0].shape):
          print("Failed to set binding shape")

      with nvtx_range("EXECV2"):
          success = trt_context.execute_v2(buffers)
      if not success:
          print("Failed to execute")
      return trt_out

profiling command:
nsys profile --cudabacktrace=true --cuda-memory-usage true -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o report${i} -f true -x true python infer.py

NVES · September 22, 2022, 9:37am

Hi,
Please refer to below links related custom plugin implementation and sample:

While IPluginV2 and IPluginV2Ext interfaces are still supported for backward compatibility with TensorRT 5.1 and 6.0.x respectively, however, we recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt or IPluginV2IOExt interfaces instead.

Thanks!

nrak · September 22, 2022, 11:09am

I may have not been clear with my question but this seems to be not very relevant to my question.
I want to understand what’s happening in the dead space after the yellow tags and before the orange Execute tag ends.
This inference is done on the TRT model, not the ONNX model. Furthermore, this model has no custom layers, nor does it implement any custom plugin.

spolisetty · September 27, 2022, 11:29am

Hi,

Could you please share with us the issue repro ONNX model, TRT engine building steps you followed, and inference script to try from our end for better debugging.

Thank you.

nrak · September 28, 2022, 11:15am

I can’t upload my TRT file as it is too big but it was generated using the following command:
trtexec --minShapes=INPUT0:1x18x512 --optShapes=INPUT0:2x18x512 --maxShapes=INPUT0:4x18x512 --onnx=model.onnx --saveEngine=model.engine
Note that I used TensorRT 8.3.4.1 for all of this (unlike what I wrote above).
This is my ONNX model model.onnx.zip (91.6 MB).

I sent my session initialization script above and my inference script is _trt_infer as above.
And I use it like this:

for _ in range(3):
    print(self._trt_infer([np.empty((2, 18, 512))]).detach().cpu().numpy().shape)

And get the following result (I don’t pay attention to the first one as it is essentially warmup).

You can see again the yellow bars are only a portion of the entire execution time.

Thank you!

spolisetty · October 11, 2022, 4:37pm

Hi,

execute() call is synchronous, meaning that it only returns when the GPU job is done.
It seems, you’re looking at the CPU activities, so it looks like it’s just waiting.
Please expand the “26 streams hidden” under the “CUDA HW” rows to see the GPU activities.

Thank you.

nrak · October 12, 2022, 7:35am

All 26 of the hidden streams are completely empty - nothing in them.
What you are saying is that the GPU work is done after this very short amount of time, and then only CPU work is done? If so, what CPU work could be done? The entire model is on GPU and the memcpy is only the small red part in the end so we’re still left with a non-explained portion in the middle (the streamSync), do you know what it is?

Thank you!

spolisetty · October 13, 2022, 4:02pm

Hi,

Could you please share with us the Nsys profiles?

Thank you.

nrak · October 16, 2022, 8:44am

Ok this is very weird. I accidentally deleted the previous .qdrep file and so I tried to reproduce the issue with the exact same code and now I get a lot of info in CUDA HW. Plus, the exported file is now .nsys-rep so I suppose Nsight auto updated and the issue was solved?
Either way, This is my Nsys report
report.nsys-rep (1.2 MB)
, note there is some code running before which is not interesting, only the last bit is (the one you already saw). From what I see here TRT just defers to a different stream and does everything there and waits for it to finish, and my problem was that for some reason Nsight wasn’t able to trace that stream?

Sorry for the confusion, I myself am not so sure what’s going on at this point.
Thank you

spolisetty · October 19, 2022, 3:04pm

Please find the following information.

nrak · October 20, 2022, 6:49am

I see, this explains a lot, thank you!
So the TensorRT can only show what GPU jobs are being launched, and not what the GPU is currently doing (although that can be seen in a different place as you pointed out)?

spolisetty · October 20, 2022, 12:32pm

Yes

system · November 3, 2022, 12:32pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inference Time is not stable TensorRT	10	1838	January 3, 2019
inference time of tensorrt is slower than tensorflow !!! TensorRT	2	1470	September 27, 2019
TensorRT inference Time TensorRT	1	793	September 20, 2018
How to use tensorrt when continuously do inference without delay? TensorRT	0	604	May 31, 2019
What is meaning of time output from TensorRT? TensorRT	1	426	August 25, 2021
Python sample yolov3 app on tensorrt Jetson Xavier NX tensorrt , yolo , python	9	1759	October 18, 2021
Why my inference time is so long when using trtexec - FP16? Jetson TX2 jetson-inference	4	2041	October 18, 2021
TensorRT inference time much slower than cuDNN TensorRT	3	2061	October 12, 2021
The first inference using tensorRT model takes far longer time than that using tensorflow model TensorRT	0	687	November 13, 2020
Low Compute utilization of converted TensorFlow model during inference Jetson TX2	19	1822	October 18, 2021

TensorRT waiting after inference seemingly for no reason

Description

Environment

Relevant Files

Related topics