TensorFlow EfficientDet-D0 -> ONNX -> TensorRT converted model fails to run in Deepstream

**• Hardware Platform (Jetson / GPU): both
**• DeepStream Version: 6.1 & 6.0.1

We are trying to use a model originally trained on Tensorflow in a Deepstream pipeline.

The model’s architecture is EfficientDet-D0. In order to convert it to an ONNX model supported by TensorRT we used this script:

The final conversion from ONNX to a TensorRT engine is working fine both directly with /usr/src/tensorrt/bin/trtexec and also when providing the ONNX model to Deepstream, but nvinfer seems to fail to load the generated TRT engine.

netron_input_output_layers

nvinfer configuration (there are some additional values set as runtime property, for example the used batch-size was 1):

[property]
onnx-file=a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx
labelfile-path=a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.labels
num-detected-classes=1
net-scale-factor=1
model-color-format=0
network-mode=0
infer-dims=3;512;512
output-blob-names=num_detections;detection_boxes;detection_scores;detection_classes
cluster-mode=4
network-type=0
parse-bbox-func-name=NvDsInferParseEfficientDetTensorflow_ONNX
custom-lib-path=/opt/lib/objectdetector.so

[class-attrs-all]
pre-cluster-threshold=0.1

Deepstream logs when deploying the model in a Jetson NX:

from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1914> [UID = 1]: Trying to create engine from model files
02:35:45 [1389]: 0:02:55.654134423 19508   0x55bdc3e8c0 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<model_inference1> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1947> [UID = 1]: serialize cuda engine to file: /var/lib/models/a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx_b1_gpu0_fp32.engine successfully
02:35:45 [1389]: 0:02:55.723697570 19508   0x55bdc3e8c0 ERROR                nvinfer gstnvinfer.cpp:632:gst_nvinfer_logger:<model_inference1> NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::allocateBuffers() <nvdsinfer_context_impl.cpp:1430> [UID = 1]: Failed to allocate cuda output buffer during context initialization
02:35:45 [1389]: 0:02:55.723811298 19508   0x55bdc3e8c0 ERROR                nvinfer gstnvinfer.cpp:632:gst_nvinfer_logger:<model_inference1> NvDsInferContext[UID 1]: Error in NvDsInferContextImpl::initialize() <nvdsinfer_context_impl.cpp:1280> [UID = 1]: Failed to allocate buffers
02:35:45 [1389]: 0:02:55.804914697 19508   0x55bdc3e8c0 WARN                 nvinfer gstnvinfer.cpp:841:gst_nvinfer_start:<model_inference1> error: Failed to create NvDsInferContext instance

Deepstream logs when deploying the model in a Jetson NX:

0:00:00.543412449   629 0x558688b86030 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<model_inference1> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1914> [UID = 1]: Trying to create engine from model files
0:00:55.733019695   629 0x558688b86030 INFO                 nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger:<model_inference1> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1947> [UID = 1]: serialize cuda engine to file: /var/lib/models/a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx_b1_gpu0_fp32.engine successfully
gst-runner: nvdsinfer_context_impl.cpp:1412: NvDsInferStatus nvdsinfer::NvDsInferContextImpl::allocateBuffers(): Assertion `bindingDims.numElements > 0' failed.

Sharing the output of trtexec:

&&&& RUNNING TensorRT.trtexec [TensorRT v8205] # /usr/src/tensorrt/bin/trtexec --onnx=a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx --saveEngine=a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx_b1_gpu0_fp32.engine --explicitBatch
[08/01/2022-23:09:52] [W] --explicitBatch flag has been deprecated and has no effect!
[08/01/2022-23:09:52] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built.
[08/01/2022-23:09:52] [I] === Model Options ===
[08/01/2022-23:09:52] [I] Format: ONNX
[08/01/2022-23:09:52] [I] Model: a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx
[08/01/2022-23:09:52] [I] Output:
[08/01/2022-23:09:52] [I] === Build Options ===
[08/01/2022-23:09:52] [I] Max batch: explicit batch
[08/01/2022-23:09:52] [I] Workspace: 16 MiB
[08/01/2022-23:09:52] [I] minTiming: 1
[08/01/2022-23:09:52] [I] avgTiming: 8
[08/01/2022-23:09:52] [I] Precision: FP32
[08/01/2022-23:09:52] [I] Calibration: 
[08/01/2022-23:09:52] [I] Refit: Disabled
[08/01/2022-23:09:52] [I] Sparsity: Disabled
[08/01/2022-23:09:52] [I] Safe mode: Disabled
[08/01/2022-23:09:52] [I] DirectIO mode: Disabled
[08/01/2022-23:09:52] [I] Restricted mode: Disabled
[08/01/2022-23:09:52] [I] Save engine: a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx_b1_gpu0_fp32.engine
[08/01/2022-23:09:52] [I] Load engine: 
[08/01/2022-23:09:52] [I] Profiling verbosity: 0
[08/01/2022-23:09:52] [I] Tactic sources: Using default tactic sources
[08/01/2022-23:09:52] [I] timingCacheMode: local
[08/01/2022-23:09:52] [I] timingCacheFile: 
[08/01/2022-23:09:52] [I] Input(s)s format: fp32:CHW
[08/01/2022-23:09:52] [I] Output(s)s format: fp32:CHW
[08/01/2022-23:09:52] [I] Input build shapes: model
[08/01/2022-23:09:52] [I] Input calibration shapes: model
[08/01/2022-23:09:52] [I] === System Options ===
[08/01/2022-23:09:52] [I] Device: 0
[08/01/2022-23:09:52] [I] DLACore: 
[08/01/2022-23:09:52] [I] Plugins:
[08/01/2022-23:09:52] [I] === Inference Options ===
[08/01/2022-23:09:52] [I] Batch: Explicit
[08/01/2022-23:09:52] [I] Input inference shapes: model
[08/01/2022-23:09:52] [I] Iterations: 10
[08/01/2022-23:09:52] [I] Duration: 3s (+ 200ms warm up)
[08/01/2022-23:09:52] [I] Sleep time: 0ms
[08/01/2022-23:09:52] [I] Idle time: 0ms
[08/01/2022-23:09:52] [I] Streams: 1
[08/01/2022-23:09:52] [I] ExposeDMA: Disabled
[08/01/2022-23:09:52] [I] Data transfers: Enabled
[08/01/2022-23:09:52] [I] Spin-wait: Disabled
[08/01/2022-23:09:52] [I] Multithreading: Disabled
[08/01/2022-23:09:52] [I] CUDA Graph: Disabled
[08/01/2022-23:09:52] [I] Separate profiling: Disabled
[08/01/2022-23:09:52] [I] Time Deserialize: Disabled
[08/01/2022-23:09:52] [I] Time Refit: Disabled
[08/01/2022-23:09:52] [I] Skip inference: Disabled
[08/01/2022-23:09:52] [I] Inputs:
[08/01/2022-23:09:52] [I] === Reporting Options ===
[08/01/2022-23:09:52] [I] Verbose: Disabled
[08/01/2022-23:09:52] [I] Averages: 10 inferences
[08/01/2022-23:09:52] [I] Percentile: 99
[08/01/2022-23:09:52] [I] Dump refittable layers:Disabled
[08/01/2022-23:09:52] [I] Dump output: Disabled
[08/01/2022-23:09:52] [I] Profile: Disabled
[08/01/2022-23:09:52] [I] Export timing to JSON file: 
[08/01/2022-23:09:52] [I] Export output to JSON file: 
[08/01/2022-23:09:52] [I] Export profile to JSON file: 
[08/01/2022-23:09:52] [I] 
[08/01/2022-23:09:52] [I] === Device Information ===
[08/01/2022-23:09:52] [I] Selected Device: NVIDIA GeForce GTX 1060 6GB
[08/01/2022-23:09:52] [I] Compute Capability: 6.1
[08/01/2022-23:09:52] [I] SMs: 10
[08/01/2022-23:09:52] [I] Compute Clock Rate: 1.7335 GHz
[08/01/2022-23:09:52] [I] Device Global Memory: 6070 MiB
[08/01/2022-23:09:52] [I] Shared Memory per SM: 96 KiB
[08/01/2022-23:09:52] [I] Memory Bus Width: 192 bits (ECC disabled)
[08/01/2022-23:09:52] [I] Memory Clock Rate: 4.004 GHz
[08/01/2022-23:09:52] [I] 
[08/01/2022-23:09:52] [I] TensorRT version: 8.2.5
[08/01/2022-23:09:53] [I] [TRT] [MemUsageChange] Init CUDA: CPU +193, GPU +0, now: CPU 205, GPU 1542 (MiB)
[08/01/2022-23:09:53] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 205 MiB, GPU 1541 MiB
[08/01/2022-23:09:53] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 268 MiB, GPU 1541 MiB
[08/01/2022-23:09:53] [I] Start parsing network model
[08/01/2022-23:09:53] [I] [TRT] ----------------------------------------------------------------
[08/01/2022-23:09:53] [I] [TRT] Input filename:   a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx
[08/01/2022-23:09:53] [I] [TRT] ONNX IR version:  0.0.6
[08/01/2022-23:09:53] [I] [TRT] Opset version:    11
[08/01/2022-23:09:53] [I] [TRT] Producer name:    
[08/01/2022-23:09:53] [I] [TRT] Producer version: 
[08/01/2022-23:09:53] [I] [TRT] Domain:           
[08/01/2022-23:09:53] [I] [TRT] Model version:    0
[08/01/2022-23:09:53] [I] [TRT] Doc string:       
[08/01/2022-23:09:53] [I] [TRT] ----------------------------------------------------------------
[08/01/2022-23:09:53] [W] [TRT] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_02/1_dn_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_03/1_dn_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_04/1_dn_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_05/1_dn_lvl_3/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,64,64,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_06/1_up_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_07/1_up_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_08/1_up_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_09/1_up_lvl_7/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,4,4,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_10/2_dn_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_11/2_dn_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_12/2_dn_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_13/2_dn_lvl_3/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,64,64,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_14/2_up_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_15/2_up_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_16/2_up_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_17/2_up_lvl_7/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,4,4,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_18/3_dn_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_19/3_dn_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_20/3_dn_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_21/3_dn_lvl_3/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,64,64,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_22/3_up_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_23/3_up_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_24/3_up_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_25/3_up_lvl_7/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,4,4,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] No importer registered for op: BatchedNMS_TRT. Attempting to import as plugin.
[08/01/2022-23:09:53] [I] [TRT] Searching for plugin: BatchedNMS_TRT, plugin_version: 1, plugin_namespace: 
[08/01/2022-23:09:53] [W] [TRT] builtin_op_importers.cpp:4780: Attribute scoreBits not found in plugin node! Ensure that the plugin creator has a default value defined or the engine may fail to build.
[08/01/2022-23:09:53] [I] [TRT] Successfully created plugin: BatchedNMS_TRT
[08/01/2022-23:09:53] [I] Finish parsing network model
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_02/1_dn_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_03/1_dn_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_04/1_dn_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_05/1_dn_lvl_3/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,64,64,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_06/1_up_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_07/1_up_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_08/1_up_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_09/1_up_lvl_7/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,4,4,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_10/2_dn_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_11/2_dn_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_12/2_dn_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_13/2_dn_lvl_3/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,64,64,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_14/2_up_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_15/2_up_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_16/2_up_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_17/2_up_lvl_7/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,4,4,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_18/3_dn_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_19/3_dn_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_20/3_dn_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_21/3_dn_lvl_3/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,64,64,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_22/3_up_lvl_4/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,32,32,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_23/3_up_lvl_5/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,16,16,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_24/3_up_lvl_6/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,8,8,64,3][NONE] dims(input1)=[1,1,1,3,1][NONE].
[08/01/2022-23:09:53] [I] [TRT] StatefulPartitionedCall/EfficientDet-D0/bifpn/node_25/3_up_lvl_7/combine/MatMul: broadcasting input1 to make tensors conform, dims(input0)=[1,4,4,64,2][NONE] dims(input1)=[1,1,1,2,1][NONE].
[08/01/2022-23:09:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +270, GPU +112, now: CPU 563, GPU 1819 (MiB)
[08/01/2022-23:09:55] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +112, GPU +46, now: CPU 675, GPU 1865 (MiB)
[08/01/2022-23:09:55] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/01/2022-23:10:37] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[08/01/2022-23:12:24] [I] [TRT] Detected 1 inputs and 4 output network tensors.
[08/01/2022-23:12:24] [I] [TRT] Total Host Persistent Memory: 316912
[08/01/2022-23:12:24] [I] [TRT] Total Device Persistent Memory: 13812736
[08/01/2022-23:12:24] [I] [TRT] Total Scratch Memory: 4346624
[08/01/2022-23:12:24] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 12 MiB, GPU 52 MiB
[08/01/2022-23:12:25] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 695.245ms to assign 16 blocks to 616 nodes requiring 54300675 bytes.
[08/01/2022-23:12:25] [I] [TRT] Total Activation Memory: 54300675
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1010, GPU 1303 (MiB)
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 1011, GPU 1313 (MiB)
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +3, GPU +18, now: CPU 3, GPU 18 (MiB)
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1005, GPU 1260 (MiB)
[08/01/2022-23:12:25] [I] [TRT] Loaded engine size: 21 MiB
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1027, GPU 1288 (MiB)
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1027, GPU 1296 (MiB)
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +17, now: CPU 0, GPU 17 (MiB)
[08/01/2022-23:12:25] [I] Engine built in 152.478 sec.
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 923, GPU 1289 (MiB)
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 923, GPU 1297 (MiB)
[08/01/2022-23:12:25] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +65, now: CPU 0, GPU 82 (MiB)
[08/01/2022-23:12:25] [I] Using random values for input input_tensor:0
[08/01/2022-23:12:25] [I] Created input binding for input_tensor:0 with dimensions 1x3x512x512
[08/01/2022-23:12:25] [I] Using random values for output num_detections
[08/01/2022-23:12:25] [I] Created output binding for num_detections with dimensions 1
[08/01/2022-23:12:25] [I] Using random values for output detection_boxes
[08/01/2022-23:12:25] [I] Created output binding for detection_boxes with dimensions 1x100x4
[08/01/2022-23:12:25] [I] Using random values for output detection_scores
[08/01/2022-23:12:25] [I] Created output binding for detection_scores with dimensions 1x100
[08/01/2022-23:12:25] [I] Using random values for output detection_classes
[08/01/2022-23:12:25] [I] Created output binding for detection_classes with dimensions 1x100
[08/01/2022-23:12:25] [I] Starting inference
[08/01/2022-23:12:28] [I] Warmup completed 14 queries over 200 ms
[08/01/2022-23:12:28] [I] Timing trace has 226 queries over 3.02941 s
[08/01/2022-23:12:28] [I] 
[08/01/2022-23:12:28] [I] === Trace details ===
[08/01/2022-23:12:28] [I] Trace averages of 10 runs:
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.4645 ms - Host latency: 13.7185 ms (end to end 26.9152 ms, enqueue 8.85314 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.7576 ms - Host latency: 14.0139 ms (end to end 27.4643 ms, enqueue 8.80132 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.0679 ms - Host latency: 13.3217 ms (end to end 26.0629 ms, enqueue 8.58351 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.0766 ms - Host latency: 13.33 ms (end to end 26.0676 ms, enqueue 8.58005 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.0637 ms - Host latency: 13.3181 ms (end to end 25.8085 ms, enqueue 8.33931 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.3895 ms - Host latency: 13.6433 ms (end to end 26.6571 ms, enqueue 8.6418 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.6559 ms - Host latency: 13.9138 ms (end to end 27.0883 ms, enqueue 8.69785 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.2522 ms - Host latency: 13.5069 ms (end to end 26.4325 ms, enqueue 8.61039 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.0745 ms - Host latency: 13.3288 ms (end to end 26.0587 ms, enqueue 8.57388 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.0666 ms - Host latency: 13.3216 ms (end to end 25.9076 ms, enqueue 8.43093 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.2116 ms - Host latency: 13.4649 ms (end to end 26.3231 ms, enqueue 8.58948 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.4458 ms - Host latency: 13.7011 ms (end to end 26.748 ms, enqueue 8.81583 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.4537 ms - Host latency: 13.7085 ms (end to end 26.8575 ms, enqueue 8.77134 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.233 ms - Host latency: 13.4878 ms (end to end 26.3739 ms, enqueue 8.68862 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.6176 ms - Host latency: 13.8739 ms (end to end 27.1198 ms, enqueue 8.88249 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.0697 ms - Host latency: 13.3263 ms (end to end 25.9787 ms, enqueue 8.49829 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.7578 ms - Host latency: 14.0157 ms (end to end 27.3895 ms, enqueue 8.58496 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.2208 ms - Host latency: 13.4765 ms (end to end 26.2712 ms, enqueue 8.59319 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.2895 ms - Host latency: 13.5448 ms (end to end 26.633 ms, enqueue 8.89253 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.3224 ms - Host latency: 13.5785 ms (end to end 26.4882 ms, enqueue 8.63699 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.7057 ms - Host latency: 13.9638 ms (end to end 27.4405 ms, enqueue 8.9313 ms)
[08/01/2022-23:12:28] [I] Average on 10 runs - GPU latency: 13.4544 ms - Host latency: 13.7103 ms (end to end 26.8259 ms, enqueue 8.81165 ms)
[08/01/2022-23:12:28] [I] 
[08/01/2022-23:12:28] [I] === Performance summary ===
[08/01/2022-23:12:28] [I] Throughput: 74.6021 qps
[08/01/2022-23:12:28] [I] Latency: min = 13.2803 ms, max = 17.0567 ms, mean = 13.5957 ms, median = 13.3355 ms, percentile(99%) = 15.1699 ms
[08/01/2022-23:12:28] [I] End-to-End Host Latency: min = 25.2949 ms, max = 30.1061 ms, mean = 26.5738 ms, median = 26.1086 ms, percentile(99%) = 29.1558 ms
[08/01/2022-23:12:28] [I] Enqueue Time: min = 7.83862 ms, max = 10.088 ms, mean = 8.671 ms, median = 8.59497 ms, percentile(99%) = 9.83936 ms
[08/01/2022-23:12:28] [I] H2D Latency: min = 0.244385 ms, max = 0.263428 ms, mean = 0.248713 ms, median = 0.247559 ms, percentile(99%) = 0.259155 ms
[08/01/2022-23:12:28] [I] GPU Compute Time: min = 13.0251 ms, max = 16.8042 ms, mean = 13.3404 ms, median = 13.0816 ms, percentile(99%) = 14.9084 ms
[08/01/2022-23:12:28] [I] D2H Latency: min = 0.00488281 ms, max = 0.00952148 ms, mean = 0.00658322 ms, median = 0.00622559 ms, percentile(99%) = 0.00927734 ms
[08/01/2022-23:12:28] [I] Total Host Walltime: 3.02941 s
[08/01/2022-23:12:28] [I] Total GPU Compute Time: 3.01492 s
[08/01/2022-23:12:28] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/01/2022-23:12:28] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8205] # /usr/src/tensorrt/bin/trtexec --onnx=a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx --saveEngine=a324eda4-2ed5-4227-b2ea-274ab2ebaf8b.onnx_b1_gpu0_fp32.engine --explicitBatch

Please share the onnx file directly.

Hello @Fiona.Chen

Sharing the ONNX file as requested
2041-7k.onnx (15.9 MB)

Have you seen the output layers of your model:

INFO: …/nvdsinfer/nvdsinfer_model_builder.cpp:610 [Implicit Engine Info]: layers num: 5
0 INPUT kFLOAT input_tensor:0 3x512x512
1 OUTPUT kINT32 num_detections 0
2 OUTPUT kFLOAT detection_boxes 100x4
3 OUTPUT kFLOAT detection_scores 100
4 OUTPUT kFLOAT detection_classes 100

The dimension of output layer “num_detections” is 0, you need to check your model.

The inference worked with the exact same ONNX file when using the TensorRT inference scripts.

But based in your answer, it seems the same problem as the shared in this topic: BatchedNMS and BatchedNMSDynamic plugins have different dimensions for num_detections output

Also in the trtexec logs I shared initially, you can see the num_detections size is 1 there

Yes. You can use the TensorRT patch in this topic