Azure CustomVision ONNX model stopped working in DS 6.2

rmpt · October 13, 2023, 10:23pm

• Hardware Platform (Jetson / GPU)
Any
• DeepStream Version
6.2

The ONNX models exported from Azure Custom Vision used to work fine in DeepStream, but after we migrated from 6.1.1 to 6.2, DeepStream now fails to load the engine in the TensorRT backend.

This is the error when Deepstream imports the TensorRT engine:

2023-10-12T18:46:06.763195Z  INFO run{deployment_id=547250a4-0520-45d2-a3d9-b2762706cf4f}:run_pipeline_inner: gst_runner::gstreamer_log: NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1909> [UID = 1]: deserialized trt engine from :/var/lib/models/f2fc6fde-cfa3-4948-85a5-667a95d6b281.onnx_b1_gpu0_fp32.engine gst_level=INFO    category=nvinfer object=model_inference1
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:512 ImplicitTrtBackend initialize failed because bindings has wildcard dims
gst-runner: nvdsinfer_context_impl.cpp:1421: NvDsInferStatus nvdsinfer::NvDsInferContextImpl::allocateBuffers(): Assertion `bindingDims.numElements > 0' failed.

Can you please let me know how to fix it?

CustomVision ONNX model sample weights file (download link)

nvinfer config (gpu-id, interval and batch-size are set as runtime property):

[property]
onnx-file=f2fc6fde-cfa3-4948-85a5-667a95d6b281.onnx
labelfile-path=f2fc6fde-cfa3-4948-85a5-667a95d6b281.labels
num-detected-classes=1
net-scale-factor=1
model-color-format=0
network-mode=0
infer-dims=3;320;320
force-implicit-batch-dim=1
output-blob-names=detected_boxes;detected_classes;detected_scores
cluster-mode=4
network-type=0
parse-bbox-func-name=NvDsInferParseCustomTfSSD_ONNX_CustomVision
custom-lib-path=/opt/lib/objectdetector.so

[class-attrs-all]
pre-cluster-threshold=0.1

ONNX model info:
Screenshot from 2023-10-12 15-54-33

TensorRT engine info:

root@ds-6.2-container:/var/lib/models# polygraphy inspect model custom_vision.engine
[I] Loading bytes from /var/lib/models/custom_vision.engine
[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine
    
    ---- 1 Engine Input(s) ----
    {image_tensor [dtype=float32, shape=(1, 3, 320, 320)]}
    
    ---- 3 Engine Output(s) ----
    {detected_boxes [dtype=float32, shape=(1, -1, 4)],
     detected_classes [dtype=int32, shape=(1, -1)],
     detected_scores [dtype=float32, shape=(1, -1)]}
    
    ---- Memory ----
    Device Memory: 14771200 bytes
    
    ---- 1 Profile(s) (4 Tensor(s) Each) ----
    - Profile: 0
        Tensor: image_tensor              (Input), Index: 0 | Shapes: min=(1, 3, 320, 320), opt=(1, 3, 320, 320), max=(1, 3, 320, 320)
        Tensor: detected_boxes           (Output), Index: 1 | Shape: (1, -1, 4)
        Tensor: detected_classes         (Output), Index: 2 | Shape: (1, -1)
        Tensor: detected_scores          (Output), Index: 3 | Shape: (1, -1)
    
    ---- 145 Layer(s) ----

anqliu · October 16, 2023, 7:27am

From your ONNX graph, it is explicit input, batch size is 1.
Please remove “force-implicit-batch-dim=1”. Thanks.

rmpt · October 16, 2023, 4:29pm

Hi @anqliu

Sorry I copied the nvinfer config from one of my attempts, but the original config didn’t included force-implicit-batch-dim=1.

When using it, deployment fails with a different error:

2023-10-12T17:13:48.601903Z  INFO run{deployment_id=547250a4-0520-45d2-a3d9-b2762706cf4f}:run_pipeline_inner: gst_runner::gstreamer_log: NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1923> [UID = 1]: Trying to create engine from model files gst_level=INFO    category=nvinfer object=model_inference1

ERROR: [TRT]: ModelImporter.cpp:732: ERROR: ModelImporter.cpp:519 In function importModel:
[4] Assertion failed: !_importer_ctx.network()->hasImplicitBatchDimension() && "This version of the ONNX parser only supports TensorRT INetworkDefinitions with an explicit batch dimension. Please ensure the network was created using the EXPLICIT_BATCH NetworkDefinitionCreationFlag."

ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:315 Failed to parse onnx file
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:971 failed to build network since parsing model errors.
ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:804 failed to build network

Can you please try to reproduce it by using the ONNX model weights I shared in the initial message and a config similar to the following?

[property]
onnx-file=f2fc6fde-cfa3-4948-85a5-667a95d6b281.onnx
labelfile-path=f2fc6fde-cfa3-4948-85a5-667a95d6b281.labels
num-detected-classes=1
net-scale-factor=1
model-color-format=0
network-mode=0
infer-dims=3;320;320
output-blob-names=detected_boxes;detected_classes;detected_scores
cluster-mode=4
network-type=0
parse-bbox-func-name=NvDsInferParseCustomTfSSD_ONNX_CustomVision
custom-lib-path=/opt/lib/objectdetector.so

[class-attrs-all]
pre-cluster-threshold=0.1

The parse-bbox-func-name is not relevant for this issue as it fails earlier when loading the TRT backend, you can bypass the post-processing by just returning:

extern "C" bool NvDsInferParseCustomTfSSD_ONNX_CustomVision(
    std::vector<NvDsInferLayerInfo> const& outputLayersInfo,
    NvDsInferNetworkInfo const& networkInfo,
    NvDsInferParseDetectionParams const& detectionParams,
    std::vector<NvDsInferParseObjectInfo>& objectList) {
      return true;
}
CHECK_CUSTOM_PARSE_FUNC_PROTOTYPE(NvDsInferParseCustomTfSSD_ONNX_CustomVision);

anqliu · October 17, 2023, 3:16am

It appears that there was an error during the model building process.
I suggest using trtexec to build the model. Run the command:
./trtexec --onnx=your.onnx --saveEngine=your.engine
If the model can be successfully built, try replacing the “onnx-file” in the config file with the “model-engine-file”.

rmpt · October 17, 2023, 10:18pm

@anqliu like I shared before the model was successfully converted to a TensorRT engine by nvinfer (please check the output of polygraphy in the 1st post).

Just nvinfer stopped being able to load the TensorRT engine of the CustomVision ONNX models when we upgraded the base DeepStream docker images from 6.1.1 to 6.2.

Converting the model with trtexec and using model-engine-file on DS like you suggested fails with the same error.

root@ds6.2:/var/lib/models# trtexec --onnx=f2fc6fde-cfa3-4948-85a5-667a95d6b281.onnx --saveEngine=f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine


&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # trtexec --onnx=f2fc6fde-cfa3-4948-85a5-667a95d6b281.onnx --saveEngine=f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine
[10/17/2023-21:53:04] [I] === Model Options ===
[10/17/2023-21:53:04] [I] Format: ONNX
[10/17/2023-21:53:04] [I] Model: f2fc6fde-cfa3-4948-85a5-667a95d6b281.onnx
[10/17/2023-21:53:04] [I] Output:
[10/17/2023-21:53:04] [I] === Build Options ===
[10/17/2023-21:53:04] [I] Max batch: explicit batch
[10/17/2023-21:53:04] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/17/2023-21:53:04] [I] minTiming: 1
[10/17/2023-21:53:04] [I] avgTiming: 8
[10/17/2023-21:53:04] [I] Precision: FP32
[10/17/2023-21:53:04] [I] LayerPrecisions: 
[10/17/2023-21:53:04] [I] Calibration: 
[10/17/2023-21:53:04] [I] Refit: Disabled
[10/17/2023-21:53:04] [I] Sparsity: Disabled
[10/17/2023-21:53:04] [I] Safe mode: Disabled
[10/17/2023-21:53:04] [I] DirectIO mode: Disabled
[10/17/2023-21:53:04] [I] Restricted mode: Disabled
[10/17/2023-21:53:04] [I] Build only: Disabled
[10/17/2023-21:53:04] [I] Save engine: f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine
[10/17/2023-21:53:04] [I] Load engine: 
[10/17/2023-21:53:04] [I] Profiling verbosity: 0
[10/17/2023-21:53:04] [I] Tactic sources: Using default tactic sources
[10/17/2023-21:53:04] [I] timingCacheMode: local
[10/17/2023-21:53:04] [I] timingCacheFile: 
[10/17/2023-21:53:04] [I] Heuristic: Disabled
[10/17/2023-21:53:04] [I] Preview Features: Use default preview flags.
[10/17/2023-21:53:04] [I] Input(s)s format: fp32:CHW
[10/17/2023-21:53:04] [I] Output(s)s format: fp32:CHW
[10/17/2023-21:53:04] [I] Input build shapes: model
[10/17/2023-21:53:04] [I] Input calibration shapes: model
[10/17/2023-21:53:04] [I] === System Options ===
[10/17/2023-21:53:04] [I] Device: 0
[10/17/2023-21:53:04] [I] DLACore: 
[10/17/2023-21:53:04] [I] Plugins:
[10/17/2023-21:53:04] [I] === Inference Options ===
[10/17/2023-21:53:04] [I] Batch: Explicit
[10/17/2023-21:53:04] [I] Input inference shapes: model
[10/17/2023-21:53:04] [I] Iterations: 10
[10/17/2023-21:53:04] [I] Duration: 3s (+ 200ms warm up)
[10/17/2023-21:53:04] [I] Sleep time: 0ms
[10/17/2023-21:53:04] [I] Idle time: 0ms
[10/17/2023-21:53:04] [I] Streams: 1
[10/17/2023-21:53:04] [I] ExposeDMA: Disabled
[10/17/2023-21:53:04] [I] Data transfers: Enabled
[10/17/2023-21:53:04] [I] Spin-wait: Disabled
[10/17/2023-21:53:04] [I] Multithreading: Disabled
[10/17/2023-21:53:04] [I] CUDA Graph: Disabled
[10/17/2023-21:53:04] [I] Separate profiling: Disabled
[10/17/2023-21:53:04] [I] Time Deserialize: Disabled
[10/17/2023-21:53:04] [I] Time Refit: Disabled
[10/17/2023-21:53:04] [I] NVTX verbosity: 0
[10/17/2023-21:53:04] [I] Persistent Cache Ratio: 0
[10/17/2023-21:53:04] [I] Inputs:
[10/17/2023-21:53:04] [I] === Reporting Options ===
[10/17/2023-21:53:04] [I] Verbose: Disabled
[10/17/2023-21:53:04] [I] Averages: 10 inferences
[10/17/2023-21:53:04] [I] Percentiles: 90,95,99
[10/17/2023-21:53:04] [I] Dump refittable layers:Disabled
[10/17/2023-21:53:04] [I] Dump output: Disabled
[10/17/2023-21:53:04] [I] Profile: Disabled
[10/17/2023-21:53:04] [I] Export timing to JSON file: 
[10/17/2023-21:53:04] [I] Export output to JSON file: 
[10/17/2023-21:53:04] [I] Export profile to JSON file: 
[10/17/2023-21:53:04] [I] 
[10/17/2023-21:53:04] [I] === Device Information ===
[10/17/2023-21:53:04] [I] Selected Device: NVIDIA GeForce GTX 1060 6GB
[10/17/2023-21:53:04] [I] Compute Capability: 6.1
[10/17/2023-21:53:04] [I] SMs: 10
[10/17/2023-21:53:04] [I] Compute Clock Rate: 1.7335 GHz
[10/17/2023-21:53:04] [I] Device Global Memory: 6064 MiB
[10/17/2023-21:53:04] [I] Shared Memory per SM: 96 KiB
[10/17/2023-21:53:04] [I] Memory Bus Width: 192 bits (ECC disabled)
[10/17/2023-21:53:04] [I] Memory Clock Rate: 4.004 GHz
[10/17/2023-21:53:04] [I] 
[10/17/2023-21:53:04] [I] TensorRT version: 8.5.2
[10/17/2023-21:53:04] [I] [TRT] [MemUsageChange] Init CUDA: CPU +9, GPU +0, now: CPU 22, GPU 1128 (MiB)
[10/17/2023-21:53:05] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +122, GPU +22, now: CPU 199, GPU 1151 (MiB)
[10/17/2023-21:53:05] [I] Start parsing network model
[10/17/2023-21:53:05] [I] [TRT] ----------------------------------------------------------------
[10/17/2023-21:53:05] [I] [TRT] Input filename:   f2fc6fde-cfa3-4948-85a5-667a95d6b281.onnx
[10/17/2023-21:53:05] [I] [TRT] ONNX IR version:  0.0.4
[10/17/2023-21:53:05] [I] [TRT] Opset version:    10
[10/17/2023-21:53:05] [I] [TRT] Producer name:    customvision
[10/17/2023-21:53:05] [I] [TRT] Producer version: 
[10/17/2023-21:53:05] [I] [TRT] Domain:           
[10/17/2023-21:53:05] [I] [TRT] Model version:    0
[10/17/2023-21:53:05] [I] [TRT] Doc string:       
[10/17/2023-21:53:05] [I] [TRT] ----------------------------------------------------------------
[10/17/2023-21:53:05] [W] [TRT] onnx2trt_utils.cpp:377: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[10/17/2023-21:53:05] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[10/17/2023-21:53:05] [I] Finish parsing network model
[10/17/2023-21:53:05] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +12, now: CPU 218, GPU 1162 (MiB)
[10/17/2023-21:53:05] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 219, GPU 1172 (MiB)
[10/17/2023-21:53:05] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/17/2023-21:55:28] [I] [TRT] Total Activation Memory: 6422669824
[10/17/2023-21:55:28] [I] [TRT] Detected 1 inputs and 3 output network tensors.
[10/17/2023-21:55:28] [I] [TRT] Total Host Persistent Memory: 178960
[10/17/2023-21:55:28] [I] [TRT] Total Device Persistent Memory: 863744
[10/17/2023-21:55:28] [I] [TRT] Total Scratch Memory: 102720
[10/17/2023-21:55:28] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 905 MiB
[10/17/2023-21:55:28] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 129 steps to complete.
[10/17/2023-21:55:28] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 4.53997ms to assign 8 blocks to 129 nodes requiring 14770688 bytes.
[10/17/2023-21:55:28] [I] [TRT] Total Activation Memory: 14770688
[10/17/2023-21:55:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 393, GPU 1185 (MiB)
[10/17/2023-21:55:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +11, now: CPU 0, GPU 11 (MiB)
[10/17/2023-21:55:28] [I] Engine built in 144.201 sec.
[10/17/2023-21:55:28] [I] [TRT] Loaded engine size: 11 MiB
[10/17/2023-21:55:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 270, GPU 1156 (MiB)
[10/17/2023-21:55:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +11, now: CPU 0, GPU 11 (MiB)
[10/17/2023-21:55:28] [I] Engine deserialized in 0.00559827 sec.
[10/17/2023-21:55:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 271, GPU 1156 (MiB)
[10/17/2023-21:55:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +15, now: CPU 0, GPU 26 (MiB)
[10/17/2023-21:55:28] [I] Setting persistentCacheLimit to 0 bytes.
[10/17/2023-21:55:28] [I] Using random values for input image_tensor
[10/17/2023-21:55:28] [I] Created input binding for image_tensor with dimensions 1x3x320x320
[10/17/2023-21:55:28] [I] Using random values for output detected_boxes
[10/17/2023-21:55:28] [I] Created output binding for detected_boxes with dimensions 1x-1x4
[10/17/2023-21:55:28] [I] Using random values for output detected_classes
[10/17/2023-21:55:28] [I] Created output binding for detected_classes with dimensions 1x-1
[10/17/2023-21:55:28] [I] Using random values for output detected_scores
[10/17/2023-21:55:28] [I] Created output binding for detected_scores with dimensions 1x-1
[10/17/2023-21:55:28] [I] Starting inference
[10/17/2023-21:55:31] [I] Warmup completed 79 queries over 200 ms
[10/17/2023-21:55:31] [I] Timing trace has 1229 queries over 3.0043 s
[10/17/2023-21:55:31] [I] 
[10/17/2023-21:55:31] [I] === Trace details ===
[10/17/2023-21:55:31] [I] Trace averages of 10 runs:
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31973 ms - Host latency: 2.42259 ms (enqueue 2.41859 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.32186 ms - Host latency: 2.42467 ms (enqueue 2.42161 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3131 ms - Host latency: 2.41595 ms (enqueue 2.41131 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31678 ms - Host latency: 2.41966 ms (enqueue 2.41548 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31763 ms - Host latency: 2.42049 ms (enqueue 2.41635 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31451 ms - Host latency: 2.41735 ms (enqueue 2.41342 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31463 ms - Host latency: 2.41749 ms (enqueue 2.41418 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31701 ms - Host latency: 2.41986 ms (enqueue 2.41613 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31414 ms - Host latency: 2.41703 ms (enqueue 2.41413 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31829 ms - Host latency: 2.42116 ms (enqueue 2.41721 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3135 ms - Host latency: 2.41638 ms (enqueue 2.41246 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.32056 ms - Host latency: 2.42345 ms (enqueue 2.41922 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31795 ms - Host latency: 2.42081 ms (enqueue 2.41721 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31518 ms - Host latency: 2.41808 ms (enqueue 2.414 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31739 ms - Host latency: 2.42026 ms (enqueue 2.41559 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.47253 ms - Host latency: 2.57542 ms (enqueue 2.57201 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.49191 ms - Host latency: 2.59456 ms (enqueue 2.58906 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.46044 ms - Host latency: 2.56313 ms (enqueue 2.56777 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31829 ms - Host latency: 2.42118 ms (enqueue 2.41783 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.7308 ms - Host latency: 2.83388 ms (enqueue 2.89098 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.45786 ms - Host latency: 2.56044 ms (enqueue 2.55618 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.32646 ms - Host latency: 2.42935 ms (enqueue 2.42617 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31478 ms - Host latency: 2.41767 ms (enqueue 2.41423 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31757 ms - Host latency: 2.42041 ms (enqueue 2.41624 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31522 ms - Host latency: 2.4181 ms (enqueue 2.41348 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31631 ms - Host latency: 2.41916 ms (enqueue 2.4152 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31368 ms - Host latency: 2.41663 ms (enqueue 2.41191 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31979 ms - Host latency: 2.42268 ms (enqueue 2.41824 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31368 ms - Host latency: 2.41667 ms (enqueue 2.4126 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31882 ms - Host latency: 2.42197 ms (enqueue 2.41763 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31761 ms - Host latency: 2.42047 ms (enqueue 2.41586 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31475 ms - Host latency: 2.41752 ms (enqueue 2.41432 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31881 ms - Host latency: 2.42152 ms (enqueue 2.41708 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3171 ms - Host latency: 2.42001 ms (enqueue 2.41503 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31754 ms - Host latency: 2.42045 ms (enqueue 2.4153 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31796 ms - Host latency: 2.42086 ms (enqueue 2.41637 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31564 ms - Host latency: 2.41842 ms (enqueue 2.41448 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31343 ms - Host latency: 2.41617 ms (enqueue 2.41155 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31865 ms - Host latency: 2.42137 ms (enqueue 2.41847 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31715 ms - Host latency: 2.41998 ms (enqueue 2.4165 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.50616 ms - Host latency: 2.60905 ms (enqueue 2.60515 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31615 ms - Host latency: 2.41895 ms (enqueue 2.41543 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31505 ms - Host latency: 2.41794 ms (enqueue 2.41404 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31499 ms - Host latency: 2.4178 ms (enqueue 2.41389 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3187 ms - Host latency: 2.42153 ms (enqueue 2.41807 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31826 ms - Host latency: 2.42114 ms (enqueue 2.41752 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31298 ms - Host latency: 2.41581 ms (enqueue 2.41277 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31648 ms - Host latency: 2.41921 ms (enqueue 2.4149 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31952 ms - Host latency: 2.42233 ms (enqueue 2.41819 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3187 ms - Host latency: 2.42147 ms (enqueue 2.41781 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31492 ms - Host latency: 2.41787 ms (enqueue 2.41442 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31595 ms - Host latency: 2.41881 ms (enqueue 2.41509 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31737 ms - Host latency: 2.42006 ms (enqueue 2.41654 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31732 ms - Host latency: 2.42002 ms (enqueue 2.41702 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3139 ms - Host latency: 2.41681 ms (enqueue 2.41322 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31633 ms - Host latency: 2.41903 ms (enqueue 2.41432 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31753 ms - Host latency: 2.42031 ms (enqueue 2.41647 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31284 ms - Host latency: 2.41566 ms (enqueue 2.41185 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31868 ms - Host latency: 2.42159 ms (enqueue 2.41747 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31611 ms - Host latency: 2.41891 ms (enqueue 2.4155 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31678 ms - Host latency: 2.41949 ms (enqueue 2.41525 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31664 ms - Host latency: 2.41965 ms (enqueue 2.41554 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31506 ms - Host latency: 2.41786 ms (enqueue 2.41386 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31393 ms - Host latency: 2.4168 ms (enqueue 2.41304 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.43423 ms - Host latency: 2.5371 ms (enqueue 2.53331 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.38359 ms - Host latency: 2.4864 ms (enqueue 2.48402 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31864 ms - Host latency: 2.42151 ms (enqueue 2.4179 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31628 ms - Host latency: 2.41902 ms (enqueue 2.41582 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31569 ms - Host latency: 2.41862 ms (enqueue 2.41484 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31718 ms - Host latency: 2.41997 ms (enqueue 2.41484 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31787 ms - Host latency: 2.42064 ms (enqueue 2.41671 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31422 ms - Host latency: 2.41704 ms (enqueue 2.41296 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31641 ms - Host latency: 2.41924 ms (enqueue 2.41511 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31768 ms - Host latency: 2.42047 ms (enqueue 2.41626 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3171 ms - Host latency: 2.41989 ms (enqueue 2.41616 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31873 ms - Host latency: 2.42158 ms (enqueue 2.41798 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31558 ms - Host latency: 2.41829 ms (enqueue 2.41484 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31929 ms - Host latency: 2.42217 ms (enqueue 2.41897 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31741 ms - Host latency: 2.42014 ms (enqueue 2.41704 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31316 ms - Host latency: 2.41592 ms (enqueue 2.41228 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31641 ms - Host latency: 2.41914 ms (enqueue 2.41565 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31589 ms - Host latency: 2.4186 ms (enqueue 2.41499 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31604 ms - Host latency: 2.4188 ms (enqueue 2.41543 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31494 ms - Host latency: 2.41777 ms (enqueue 2.41418 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31682 ms - Host latency: 2.41965 ms (enqueue 2.41621 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31916 ms - Host latency: 2.422 ms (enqueue 2.41804 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31489 ms - Host latency: 2.41763 ms (enqueue 2.41438 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31648 ms - Host latency: 2.41921 ms (enqueue 2.41594 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31616 ms - Host latency: 2.41895 ms (enqueue 2.41465 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.40559 ms - Host latency: 2.50862 ms (enqueue 2.49448 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.32893 ms - Host latency: 2.43162 ms (enqueue 2.42742 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31572 ms - Host latency: 2.41863 ms (enqueue 2.41384 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.32039 ms - Host latency: 2.42319 ms (enqueue 2.41858 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3176 ms - Host latency: 2.42031 ms (enqueue 2.41619 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31453 ms - Host latency: 2.41738 ms (enqueue 2.41306 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3187 ms - Host latency: 2.42146 ms (enqueue 2.41787 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31863 ms - Host latency: 2.42163 ms (enqueue 2.41765 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31965 ms - Host latency: 2.42263 ms (enqueue 2.4186 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31572 ms - Host latency: 2.41863 ms (enqueue 2.41423 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31626 ms - Host latency: 2.41897 ms (enqueue 2.41545 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31804 ms - Host latency: 2.42097 ms (enqueue 2.41646 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31738 ms - Host latency: 2.42034 ms (enqueue 2.41672 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31699 ms - Host latency: 2.4198 ms (enqueue 2.41638 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31255 ms - Host latency: 2.41577 ms (enqueue 2.41123 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31013 ms - Host latency: 2.41296 ms (enqueue 2.40928 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31846 ms - Host latency: 2.42109 ms (enqueue 2.41726 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31631 ms - Host latency: 2.41912 ms (enqueue 2.41611 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31655 ms - Host latency: 2.41934 ms (enqueue 2.41616 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31331 ms - Host latency: 2.41609 ms (enqueue 2.41235 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31274 ms - Host latency: 2.4156 ms (enqueue 2.41135 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.32024 ms - Host latency: 2.42302 ms (enqueue 2.41902 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.3176 ms - Host latency: 2.42031 ms (enqueue 2.4166 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31653 ms - Host latency: 2.41936 ms (enqueue 2.41516 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31455 ms - Host latency: 2.41721 ms (enqueue 2.41426 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.41228 ms - Host latency: 2.51489 ms (enqueue 2.51589 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31638 ms - Host latency: 2.41921 ms (enqueue 2.41531 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31597 ms - Host latency: 2.41873 ms (enqueue 2.41497 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31682 ms - Host latency: 2.41963 ms (enqueue 2.41614 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31777 ms - Host latency: 2.42063 ms (enqueue 2.41709 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31672 ms - Host latency: 2.41951 ms (enqueue 2.41514 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31475 ms - Host latency: 2.41755 ms (enqueue 2.41377 ms)
[10/17/2023-21:55:31] [I] Average on 10 runs - GPU latency: 2.31409 ms - Host latency: 2.41675 ms (enqueue 2.41296 ms)
[10/17/2023-21:55:31] [I] 
[10/17/2023-21:55:31] [I] === Performance summary ===
[10/17/2023-21:55:31] [I] Throughput: 409.08 qps
[10/17/2023-21:55:31] [I] Latency: min = 2.40259 ms, max = 4.48938 ms, mean = 2.43246 ms, median = 2.41895 ms, percentile(90%) = 2.43066 ms, percentile(95%) = 2.43835 ms, percentile(99%) = 2.83197 ms
[10/17/2023-21:55:31] [I] Enqueue Time: min = 2.39722 ms, max = 4.48743 ms, mean = 2.42911 ms, median = 2.41504 ms, percentile(90%) = 2.42761 ms, percentile(95%) = 2.43509 ms, percentile(99%) = 2.82538 ms
[10/17/2023-21:55:31] [I] H2D Latency: min = 0.0973511 ms, max = 0.0991821 ms, mean = 0.0982423 ms, median = 0.0982666 ms, percentile(90%) = 0.0983887 ms, percentile(95%) = 0.0985107 ms, percentile(99%) = 0.0986328 ms
[10/17/2023-21:55:31] [I] GPU Compute Time: min = 2.2998 ms, max = 4.38562 ms, mean = 2.32963 ms, median = 2.31616 ms, percentile(90%) = 2.32764 ms, percentile(95%) = 2.33569 ms, percentile(99%) = 2.72888 ms
[10/17/2023-21:55:31] [I] D2H Latency: min = 0.00408936 ms, max = 0.00805664 ms, mean = 0.00458358 ms, median = 0.0045166 ms, percentile(90%) = 0.00476074 ms, percentile(95%) = 0.00488281 ms, percentile(99%) = 0.00488281 ms
[10/17/2023-21:55:31] [I] Total Host Walltime: 3.0043 s
[10/17/2023-21:55:31] [I] Total GPU Compute Time: 2.86312 s
[10/17/2023-21:55:31] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[10/17/2023-21:55:31] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[10/17/2023-21:55:31] [W] * GPU compute time is unstable, with coefficient of variance = 4.44933%.
[10/17/2023-21:55:31] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/17/2023-21:55:31] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/17/2023-21:55:31] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # trtexec --onnx=f2fc6fde-cfa3-4948-85a5-667a95d6b281.onnx --saveEngine=f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine

TRT model info:

root@ds6.2:/var/lib/models# polygraphy inspect model f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine

[I] Loading bytes from /var/lib/models/f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine
[I] ==== TensorRT Engine ====
    Name: Unnamed Network 0 | Explicit Batch Engine
    
    ---- 1 Engine Input(s) ----
    {image_tensor [dtype=float32, shape=(1, 3, 320, 320)]}
    
    ---- 3 Engine Output(s) ----
    {detected_boxes [dtype=float32, shape=(1, -1, 4)],
     detected_classes [dtype=int32, shape=(1, -1)],
     detected_scores [dtype=float32, shape=(1, -1)]}
    
    ---- Memory ----
    Device Memory: 14770688 bytes
    
    ---- 1 Profile(s) (4 Tensor(s) Each) ----
    - Profile: 0
        Tensor: image_tensor              (Input), Index: 0 | Shapes: min=(1, 3, 320, 320), opt=(1, 3, 320, 320), max=(1, 3, 320, 320)
        Tensor: detected_boxes           (Output), Index: 1 | Shape: (1, -1, 4)
        Tensor: detected_classes         (Output), Index: 2 | Shape: (1, -1)
        Tensor: detected_scores          (Output), Index: 3 | Shape: (1, -1)
    
    ---- 140 Layer(s) ----

The hardcoded config to load the previously generated TensorRT engine:

root@ds6.2:/var/lib/models# cat f2fc6fde-cfa3-4948-85a5-667a95d6b281.91c0db8e-e0c8-4b89-8ef8-52d3298ea32c.config
[property]
model-engine-file=f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine
labelfile-path=f2fc6fde-cfa3-4948-85a5-667a95d6b281.labels
num-detected-classes=1
net-scale-factor=1
model-color-format=0
network-mode=0
infer-dims=3;320;320
output-blob-names=detected_boxes;detected_classes;detected_scores
cluster-mode=4
network-type=0
parse-bbox-func-name=DisableParsing
custom-lib-path=/opt/lib/objectdetector.so

[class-attrs-all]
pre-cluster-threshold=0.1

The output of deepstream deployment:

2023-10-17T22:07:45.342680Z  INFO run{deployment_id=91c0db8e-e0c8-4b89-8ef8-52d3298ea32c}:run_pipeline_inner: gst_runner::gstreamer_log: NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1909> [UID = 1]: deserialized trt engine from :/var/lib/models/f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine gst_level=INFO    category=nvinfer object=model_inference1

ERROR: ../nvdsinfer/nvdsinfer_model_builder.cpp:512 ImplicitTrtBackend initialize failed because bindings has wildcard dims

2023-10-17T22:07:45.352089Z  INFO run{deployment_id=91c0db8e-e0c8-4b89-8ef8-52d3298ea32c}:run_pipeline_inner: gst_runner::gstreamer_log: NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:2012> [UID = 1]: Use deserialized engine model: /var/lib/models/f2fc6fde-cfa3-4948-85a5-667a95d6b281.trtexec.engine gst_level=INFO    category=nvinfer object=model_inference1

gst-runner: nvdsinfer_context_impl.cpp:1421: NvDsInferStatus nvdsinfer::NvDsInferContextImpl::allocateBuffers(): Assertion `bindingDims.numElements > 0' failed.

Isn’t a bug DeepStream trying to load a TensorRT Explicit Batch Engine with the ImplicitTrtBackend?

anqliu · October 18, 2023, 12:04pm

You can check the printEngineInfo function in nvdsinfer/nvdsinfer_model_builder.cpp.
Since your model has explicit input, this means that if (hasWildcard(checkDims)) is false, which triggers the call to the getImplicitLayersInfo function to check if there are any other implicit batch dimensions in the layers,which triggers the error message.

ERROR: …/nvdsinfer/nvdsinfer_model_builder.cpp:512 ImplicitTrtBackend initialize failed because bindings has wildcard dims

Could you try setting a fixed maximum value(such as 4096 ) for the ‘num’ parameter in the NMS output of the ONNX file and see if the issue still persists?

rmpt · October 18, 2023, 3:28pm

@anqliu the dynamic dimension num of the model output layers has nothing to do with NMS settings, it refers to the number of detected objects (which is variable depending on the input frame and clustering applied by the NMS).

Here you can find the NMS layer settings:

This ONNX model can be successfully deployed in DeepStream 6.1.1 without any change, can you please let me know why you think the issue is related with the model?

anqliu · October 19, 2023, 7:37am

DeepStream 6.1.1 relies on TensorRT 8.4.0.1, while DeepStream 6.2 relies on TensorRT 8.5.0.2.

In DeepStream 6.1.1, when generating the engine from ONNX, the NMS (Non-Maximum Suppression) is processed using a plugin. However, in DeepStream 6.2, NMS is handled as a built-in operation, resulting in differences in engine generation.
You can refer to the log below for specific differences, and you can also reproduce the issue using the provided command.

/usr/src/tensorrt/bin/trtexec --onnx=xxx.onnx --saveEngine=xxx.engine --verbose

The core part of the log for DeepStream 6.1.1 is as follows:
The generated engine has a fixed output dimension of 256. I suspect it is related to the setting of “max_output_boxes_per_class” in your ONNX file.

Layer(NoOp): mbox_conf/transpose, Tactic: 0x0000000000000000, mbox_conf/sigmoid[Float(1,3210,1)] → mbox_conf/transpose[Float(1,1,3210)]
Layer(Reduce): get_max_scores, Tactic: 0x0000000000000008, mbox_conf/transpose[Float(1,1,3210)] → max_scores[Float(1,1,3210)]
Layer(NoOp): non_max_suppression, Tactic: 0x0000000000000000, max_scores[Float(1,1,3210)] → (Unnamed Layer* 189) [Shuffle]_output[Float(1,3210,1)]
Layer(PluginV2): non_max_suppression_7, Tactic: 0x0000000000000000, raw_coordinates[Float(1,3210,4)], (Unnamed Layer* 189) [Shuffle]_output[Float(1,3210,1)] → selected_indices[Int32(256,3)]

[10/19/2023-06:54:10] [I] Created input binding for image_tensor with dimensions 1x3x320x320
[10/19/2023-06:54:10] [I] Using random values for output detected_boxes
[10/19/2023-06:54:10] [I] Created output binding for detected_boxes with dimensions 1x256x4
[10/19/2023-06:54:10] [I] Using random values for output detected_classes
[10/19/2023-06:54:10] [I] Created output binding for detected_classes with dimensions 1x256
[10/19/2023-06:54:10] [I] Using random values for output detected_scores
[10/19/2023-06:54:10] [I] Created output binding for detected_scores with dimensions 1x256

The core part of the log for DeepStream 6.2 is as follows:

Layer(NoOp): mbox_conf/transpose, Tactic: 0x0000000000000000, mbox_conf/sigmoid (Float[1,3210,1]) → mbox_conf/transpose (Float[1,1,3210])
Layer(Reduce): get_max_scores, Tactic: 0x0000000000000008, mbox_conf/transpose (Float[1,1,3210]) → max_scores (Float[1,1,3210])
Layer(TopK): get_max_classes, Tactic: 0x0000000000000003, mbox_conf/transpose (Float[1,1,3210]) → (Unnamed Layer* 187) [TopK]_output_1 (Float[1,1,3210]), max_classes (Int32[1,1,3210])
Layer(NoOp): non_max_suppression, Tactic: 0x0000000000000000, max_scores (Float[1,1,3210]) → (Unnamed Layer* 192) [Shuffle]_output (Float [1,3210,1])
Layer(NMS): non_max_suppression_7, Tactic: 0x0000000000000000, raw_coordinates (Float[1,3210,4]), (Unnamed Layer* 192) [Shuffle]_output (Float [1,3210,1]), (Unnamed Layer* 189) [Constant]_output (Int32[ ]), (Unnamed Layer* 190) [Constant]_output (Float[ ]), (Unnamed Layer* 191) [Constant]_output (Float[ ]) → selected_indices (Int32[-1,3]), (Unnamed Layer* 193) [NMS]_1_output (Int32[ ])
Layer(DeviceToShapeHost): (Unnamed Layer* 193) [NMS]_1_output[DevicetoShapeHostCopy], Tactic: 0x0000000000000000, (Unnamed Layer* 193) [NMS]_1_output (Int32[ ]) →

[10/19/2023-06:13:33] [I] Created input binding for image_tensor with dimensions 1x3x320x320
[10/19/2023-06:13:33] [I] Using random values for output detected_boxes
[10/19/2023-06:13:33] [I] Created output binding for detected_boxes with dimensions 1x-1x4
[10/19/2023-06:13:33] [I] Using random values for output detected_classes
[10/19/2023-06:13:33] [I] Created output binding for detected_classes with dimensions 1x-1
[10/19/2023-06:13:33] [I] Using random values for output detected_scores
[10/19/2023-06:13:33] [I] Created output binding for detected_scores with dimensions 1x-1

Due to the presence of “dim==-1” in the output of the engine generated in DeepStream 6.2, it results in an error during runtime.
To avoid this error in version 6.2, my suggestion is to set the NMS operation type as “EfficientNMS_TRT”（Reference link） when exporting the ONNX file. This way, during engine compilation, the NMS can be treated as a plugin instead of using the built-in OP. This will ensure that the generated engine has a fixed output dimension instead of a dynamic dimension.
I hope this can be helpful to you. Thanks.

rmpt · October 20, 2023, 7:01pm

Thanks @anqliu those logs are really helpful to understand the issue.

I tried set the output dimension num of the output layers as 256 but TensorRT still forces the generated engine to have dynamic output dimensions (likely because of the previous nodes after NMS like you identified)

Replacing the NMS to a EfficientNMS_TRT doesn’t seems like a trivial task, specially because on this model architecture the NMS layer is not an output, there are several other nodes that rely on that.

Before it was possible to train the model on Azure CustomVision and upload the exported ONNX directly to our platform (which is primarily focused in NVIDIA gateways and uses DeepStream under-the-hood).

Is NVIDIA planning to fix this behavior either on the trtexec/TRT Backend combo, or by having for example an additional argument for trtexec to fallback to use a Plugin with a common NMS layer?

anqliu · October 23, 2023, 6:59am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Currently, trtexec cannot achieve the desired functionality by additional argument.
If modifying the export code is complex, you can also try using ONNX GraphSurgeon to modify the ONNX file. Thanks.

system · November 7, 2023, 5:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Importing image from Azure Custom Vision Compact Domain (S1) into Deepstream 6 (SSD model) DeepStream SDK	10	732	April 7, 2022
TensorFlow EfficientDet-D0 -> ONNX -> TensorRT converted model fails to run in Deepstream DeepStream SDK deepstream61	8	1010	August 11, 2022
Onnx to trt engine DeepStream SDK	5	858	October 12, 2021
How to generate a tensorrt model that is supported by Deesptream sdk DeepStream SDK	17	592	January 29, 2024
Error importing model engine in deepstream TensorRT	5	970	December 12, 2022
DeepStream DeepStream SDK	9	576	October 12, 2021
Reshaping error when set batch-size greater than 1 in onnx modle DeepStream SDK	23	1263	February 10, 2023
Step-wise procedure to deploy a custom tensorflow 2.4 object detection model in deepstream 5.1 DeepStream SDK tensorrt , tensorflow , jetson-inference	18	2321	September 28, 2021
Issues running Onnx classifier model in deepstream DeepStream SDK tensorrt , onnx	5	1690	October 12, 2021
Regarding doubts about deepstream custom parser for onnx with deepstream batch DeepStream SDK gstreamer , deepstream	5	52	September 14, 2024

Azure CustomVision ONNX model stopped working in DS 6.2

Related topics