nvinfer
automatically converts the original model’s format to a TensorRT
engine.
To achieve the same with nvinferserver
(Triton server) our application needs to take care of building the TensorRT engine using tao-converter
, trtexec
or some other external binary depending on the original format.
I have a couple of questions in order to implement the TensorRT conversion for ONNX models:
-
nvinfer
config for ONNX models doesn’t requires we pass theinput layer name
. But that info seems to be required when usingtrtexec
. Can the input layer name be omitted in thetrtexec
call, or there’s some way we can obtain this information programmatically? -
If the conversion to a TensorRT engine is performed with fixed batch size:
trtexec --buildOnly --optShapes=input:8x3x512x896 --onnx=/var/lib/models/triton_model_repo/a82e9df2-4eb1-454a-93c2-8b8fa113b840/1/a82e9df2-4eb1-454a-93c2-8b8fa113b840.onnx --fp16 --saveEngine=/var/lib/models/triton_model_repo/a82e9df2-4eb1-454a-93c2-8b8fa113b840/1/a82e9df2-4eb1-454a-93c2-8b8fa113b840.engine
and I deploy a single pipeline with 1 input stream I get this error:
2023-10-11T03:58:48.396812Z INFO triton_server: E1011 03:58:48.396782 176 tensorrt.cc:2130] error setting the binding dimension
ERROR: infer_grpc_client.cpp:427 inference failed with error: request specifies invalid shape for input 'input' for a82e9df2-4eb1-454a-93c2-8b8fa113b840_0. Error details: model expected the shape of dimension 0 to be between 8 and 8 but received 1
ERROR: infer_trtis_backend.cpp:372 failed to specify dims after running inference failed on model:a82e9df2-4eb1-454a-93c2-8b8fa113b840, nvinfer error:NVDSINFER_TRITON_ERROR
2023-10-11T03:58:48.396902Z INFO run{deployment_id=40a75648-1573-4b15-aff3-ef1250686ad6}:run_pipeline_inner: gst_runner::gstreamer_log: nvinferserver[UID 1]: Error in specifyBackendDims() <infer_grpc_context.cpp:164> [UID = 1]: failed to specify input dims triton backend for model:a82e9df2-4eb1-454a-93c2-8b8fa113b840, nvinfer error:NVDSINFER_TRITON_ERROR gst_level=ERROR category=nvinferserver object=model_inference1
But when the TRT engine is created with dynamic batch by specifying different min
, opt
and max
values, it will work for single pipeline, our multiple running in parallel:
--minShapes=input:1x{input_shape}
--optShapes=input:4x{input_shape}
--maxShapes=input:8x{input_shape}
Is it correct to assume the ONNX models that have original fixed batch size of 1
, will suffer inference bottlenecks in comparison to models that support dynamic batch when Triton is serving multiple pipelines in parallel?
I’m referring when we have to use this shape combination:
--minShapes=input:1x{input_shape}
--optShapes=input:1x{input_shape}
--maxShapes=input:1x{input_shape}