Description
I am tring to serve a model using TensorRT. My original model is in TorchScript, I converted it to ONNX and then to TensorRT. When profiling the inference using Nsys Profiler, I get the following output (I infer multiple times consecutively).
When zooming in on to the yellow parts in the TensorRT, I can match the name of the last operation to the last operation before the output in the ONNX model, when viewing it on Netron.
From this I conclude that by the time the last yellow tag is complete, the inference is done. This also correlates to the fact that TRT should be faster than the equivalent TorchScript model (I am benchmarking them here), which currently takes 14ms (only the yellow takes 6ms). And yet, Execute doesn’t return for another 146ms (_trt_infer is my inference function, attached, and enter is. a context manager I added to differ the execution from the other parts of the function).
Is this incredibly long waiting period intentional or am I doing something wrong? Has the inference not really ended yet and the waiting is the inference itself, or some kind of syncing? Or maybe TRT infers lazily and then the yellow is only queueing the operations or something?
I couldn’t find any documentation explaining this peculiar behaviour and I would most appreciate someone explaining to me what is going on, and hopefully removing this undesired behaviour.
Thank you very much in advance,
N
Environment
TensorRT Version: 8.4.2.4
GPU Type: T4
Nvidia Driver Version: 460.73.01
CUDA Version: 11.2
CUDNN Version: 8.2.0
Operating System + Version:
Python Version (if applicable): 3.8.13
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.12.1
Baremetal or Container (if container which image + tag):
Relevant Files
session initialization code:
runtime = trt.Runtime(trt.Logger(trt.Logger.VERBOSE))
with open(trt_path, "rb") as f:
serialized_engine = f.read()
engine = runtime.deserialize_cuda_engine(serialized_engine)
trt_context = engine.create_execution_context()
_trt_infer code:
@contextmanager
def nvtx_range(msg):
depth = torch.cuda.nvtx.range_push(msg)
try:
yield depth
finally:
torch.cuda.nvtx.range_pop()
def _trt_infer(b: Tuple[np.ndarray]) -> torch.Tensor:
"""Runs inference on TRT engine
:param b: np array of shape (I, B, ...) where ... represents dimensions of single input from a single input source
"""
torch_b = tuple(torch.from_numpy(i).to("cuda") for i in b)
input_idx = engine["INPUT0"]
output_idx = engine["OUTPUT0"]
buffers = [None] * 2
trt_out = torch.empty(
(torch_b[0].shape[0], 3, 1024, 1024), dtype=torch_b[0].dtype, device="cuda"
)
buffers[output_idx] = trt_out.data_ptr()
buffers[input_idx] = torch_b[0].data_ptr()
if not trt_context.set_binding_shape(input_idx, torch_b[0].shape):
print("Failed to set binding shape")
with nvtx_range("EXECV2"):
success = trt_context.execute_v2(buffers)
if not success:
print("Failed to execute")
return trt_out
profiling command:
nsys profile --cudabacktrace=true --cuda-memory-usage true -w true -t cuda,nvtx,osrt,cudnn,cublas -s none -o report${i} -f true -x true python infer.py