Cannot serialize ONNX model on TensorRT 8

Description

I have an ONNX model with TopK layer at the end, the same code can be used to serialize this model on TenortRT 7.2.1 on Windows 10/CUDA11.0 but failed on TensorRT 8 on Ubuntu 20.04/CUDA11.3 with the following error:

--------------- Timing Runner: TopK_99 (TopK)
Tactic: 0 skipped. Scratch requested: 1643520, available: 1048576
Tactic: 1 skipped. Scratch requested: 1643520, available: 1048576
Tactic: 3 skipped. Scratch requested: 1643520, available: 1048576
Tactic: 2 skipped. Scratch requested: 1643520, available: 1048576
Fastest Tactic: -3360065831133338131 Time: 3.40282e+38
Deleting timing cache: 56 entries, 138 hits
[MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1349, GPU 806 (MiB)
10: [optimizer.cpp::computeCosts::1853] Error Code 10: Internal Error (Could not find any implementation for node TopK_99.)
fish: Job 1, ‘./build_ocr’ terminated by signal SIGSEGV (Address boundary error)

Environment

TensorRT Version: 8
GPU Type: RTX 2080Ti
Nvidia Driver Version: 465.19.01
CUDA Version: 11.3
CUDNN Version: 8.2.0
Operating System + Version: Ubuntu 20.04 kernel 5.8.0
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

ONNX file: ocr-46k-v2020-9-21.onnx - Google Drive
serialize code: build_ocr.cc - Google Drive

Steps To Reproduce

code is compiled with
g+±10 -std=c++20 build_ocr.cc -I /usr/local/cuda-11.3/include/ -L /usr/local/cuda-11.3/lib64/ -o build_ocr -lcudart -lcuda -lnvinfer -lnvonnxparser

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

I have tried both, they are both working. However I am using trtexec from nvcr.io/nvidia/tensorrt:21.05-py3 container which runs TensorRT 7.2.3.
The problem only occurs for TensorRT 8, TensorRT 7.x all works fine.
Output of trtexec:

=====================
== NVIDIA TensorRT ==

NVIDIA Release 21.05 (build 22596545)

NVIDIA TensorRT 7.2.3 (c) 2016-2021, NVIDIA CORPORATION. All rights reserved.
Container image (c) 2021, NVIDIA CORPORATION. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh

To install the open-source samples corresponding to this TensorRT release version run /opt/tensorrt/install_opensource.sh.
To build the open source parsers, plugins, and samples for current top-of-tree on master or a different branch, run /opt/tensorrt/install_opensource.sh -b
See GitHub - NVIDIA/TensorRT: TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators. for more information.

&&&& RUNNING TensorRT.trtexec # trtexec --onnx=ocr-46k-v2020-9-21.onnx --shapes=input:32x3x32x256
[05/22/2021-02:31:50] [I] === Model Options ===
[05/22/2021-02:31:50] [I] Format: ONNX
[05/22/2021-02:31:50] [I] Model: ocr-46k-v2020-9-21.onnx
[05/22/2021-02:31:50] [I] Output:
[05/22/2021-02:31:50] [I] === Build Options ===
[05/22/2021-02:31:50] [I] Max batch: explicit
[05/22/2021-02:31:50] [I] Workspace: 16 MiB
[05/22/2021-02:31:50] [I] minTiming: 1
[05/22/2021-02:31:50] [I] avgTiming: 8
[05/22/2021-02:31:50] [I] Precision: FP32
[05/22/2021-02:31:50] [I] Calibration:
[05/22/2021-02:31:50] [I] Refit: Disabled
[05/22/2021-02:31:50] [I] Safe mode: Disabled
[05/22/2021-02:31:50] [I] Save engine:
[05/22/2021-02:31:50] [I] Load engine:
[05/22/2021-02:31:50] [I] Builder Cache: Enabled
[05/22/2021-02:31:50] [I] NVTX verbosity: 0
[05/22/2021-02:31:50] [I] Tactic sources: Using default tactic sources
[05/22/2021-02:31:50] [I] Input(s)s format: fp32:CHW
[05/22/2021-02:31:50] [I] Output(s)s format: fp32:CHW
[05/22/2021-02:31:50] [I] Input build shape: input=32x3x32x256+32x3x32x256+32x3x32x256
[05/22/2021-02:31:50] [I] Input calibration shapes: model
[05/22/2021-02:31:50] [I] === System Options ===
[05/22/2021-02:31:50] [I] Device: 0
[05/22/2021-02:31:50] [I] DLACore:
[05/22/2021-02:31:50] [I] Plugins:
[05/22/2021-02:31:50] [I] === Inference Options ===
[05/22/2021-02:31:50] [I] Batch: Explicit
[05/22/2021-02:31:50] [I] Input inference shape: input=32x3x32x256
[05/22/2021-02:31:50] [I] Iterations: 10
[05/22/2021-02:31:50] [I] Duration: 3s (+ 200ms warm up)
[05/22/2021-02:31:50] [I] Sleep time: 0ms
[05/22/2021-02:31:50] [I] Streams: 1
[05/22/2021-02:31:50] [I] ExposeDMA: Disabled
[05/22/2021-02:31:50] [I] Data transfers: Enabled
[05/22/2021-02:31:50] [I] Spin-wait: Disabled
[05/22/2021-02:31:50] [I] Multithreading: Disabled
[05/22/2021-02:31:50] [I] CUDA Graph: Disabled
[05/22/2021-02:31:50] [I] Separate profiling: Disabled
[05/22/2021-02:31:50] [I] Skip inference: Disabled
[05/22/2021-02:31:50] [I] Inputs:
[05/22/2021-02:31:50] [I] === Reporting Options ===
[05/22/2021-02:31:50] [I] Verbose: Disabled
[05/22/2021-02:31:50] [I] Averages: 10 inferences
[05/22/2021-02:31:50] [I] Percentile: 99
[05/22/2021-02:31:50] [I] Dump refittable layers:Disabled
[05/22/2021-02:31:50] [I] Dump output: Disabled
[05/22/2021-02:31:50] [I] Profile: Disabled
[05/22/2021-02:31:50] [I] Export timing to JSON file:
[05/22/2021-02:31:50] [I] Export output to JSON file:
[05/22/2021-02:31:50] [I] Export profile to JSON file:
[05/22/2021-02:31:50] [I]
[05/22/2021-02:31:50] [I] === Device Information ===
[05/22/2021-02:31:50] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[05/22/2021-02:31:50] [I] Compute Capability: 7.5
[05/22/2021-02:31:50] [I] SMs: 68
[05/22/2021-02:31:50] [I] Compute Clock Rate: 1.755 GHz
[05/22/2021-02:31:50] [I] Device Global Memory: 11019 MiB
[05/22/2021-02:31:50] [I] Shared Memory per SM: 64 KiB
[05/22/2021-02:31:50] [I] Memory Bus Width: 352 bits (ECC disabled)
[05/22/2021-02:31:50] [I] Memory Clock Rate: 7 GHz
[05/22/2021-02:31:50] [I]
[05/22/2021-02:31:59] [I] [TRT] ----------------------------------------------------------------
[05/22/2021-02:31:59] [I] [TRT] Input filename: ocr-46k-v2020-9-21.onnx
[05/22/2021-02:31:59] [I] [TRT] ONNX IR version: 0.0.6
[05/22/2021-02:31:59] [I] [TRT] Opset version: 12
[05/22/2021-02:31:59] [I] [TRT] Producer name: pytorch
[05/22/2021-02:31:59] [I] [TRT] Producer version: 1.9
[05/22/2021-02:31:59] [I] [TRT] Domain:
[05/22/2021-02:31:59] [I] [TRT] Model version: 0
[05/22/2021-02:31:59] [I] [TRT] Doc string:
[05/22/2021-02:31:59] [I] [TRT] ----------------------------------------------------------------
[05/22/2021-02:31:59] [W] [TRT] /home/jenkins/agent/workspace/OSS/OSS_L0_MergeRequest/oss/parsers/onnx/onnx2trt_utils.cpp:227: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/22/2021-02:32:00] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[05/22/2021-02:32:08] [I] [TRT] Detected 1 inputs and 2 output network tensors.
[05/22/2021-02:32:08] [I] Engine built in 17.3336 sec.
[05/22/2021-02:32:08] [I] Starting inference
[05/22/2021-02:32:11] [I] Warmup completed 0 queries over 200 ms
[05/22/2021-02:32:11] [I] Timing trace has 0 queries over 3.0615 s
[05/22/2021-02:32:11] [I] Trace averages of 10 runs:
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.2047 ms - Host latency: 22.7061 ms (end to end 44.3433 ms, enqueue 0.25303 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.2777 ms - Host latency: 22.7786 ms (end to end 44.4851 ms, enqueue 0.256689 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.3066 ms - Host latency: 22.8063 ms (end to end 44.504 ms, enqueue 0.236127 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.396 ms - Host latency: 22.8964 ms (end to end 44.7539 ms, enqueue 0.243872 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.3689 ms - Host latency: 22.8724 ms (end to end 44.658 ms, enqueue 0.292029 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.2525 ms - Host latency: 22.7543 ms (end to end 44.4395 ms, enqueue 0.268433 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.2277 ms - Host latency: 22.7285 ms (end to end 43.8266 ms, enqueue 0.231506 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.4912 ms - Host latency: 22.9931 ms (end to end 44.8934 ms, enqueue 0.251599 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.4305 ms - Host latency: 22.9408 ms (end to end 44.8159 ms, enqueue 0.292444 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.3241 ms - Host latency: 22.8337 ms (end to end 44.5946 ms, enqueue 0.282642 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.3398 ms - Host latency: 22.8398 ms (end to end 44.6165 ms, enqueue 0.256445 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.4122 ms - Host latency: 22.9134 ms (end to end 44.7062 ms, enqueue 0.283838 ms)
[05/22/2021-02:32:11] [I] Average on 10 runs - GPU latency: 22.4602 ms - Host latency: 22.9635 ms (end to end 44.867 ms, enqueue 0.277002 ms)
[05/22/2021-02:32:11] [I] Host Latency
[05/22/2021-02:32:11] [I] min: 22.5284 ms (end to end 38.8506 ms)
[05/22/2021-02:32:11] [I] max: 23.4106 ms (end to end 45.5188 ms)
[05/22/2021-02:32:11] [I] mean: 22.8489 ms (end to end 44.5823 ms)
[05/22/2021-02:32:11] [I] median: 22.8138 ms (end to end 44.5718 ms)
[05/22/2021-02:32:11] [I] percentile: 23.1873 ms at 99% (end to end 45.2979 ms at 99%)
[05/22/2021-02:32:11] [I] throughput: 0 qps
[05/22/2021-02:32:11] [I] walltime: 3.0615 s
[05/22/2021-02:32:11] [I] Enqueue Time
[05/22/2021-02:32:11] [I] min: 0.202637 ms
[05/22/2021-02:32:11] [I] max: 0.371338 ms
[05/22/2021-02:32:11] [I] median: 0.254761 ms
[05/22/2021-02:32:11] [I] GPU Compute
[05/22/2021-02:32:11] [I] min: 22.0267 ms
[05/22/2021-02:32:11] [I] max: 22.9067 ms
[05/22/2021-02:32:11] [I] mean: 22.3462 ms
[05/22/2021-02:32:11] [I] median: 22.3125 ms
[05/22/2021-02:32:11] [I] percentile: 22.6871 ms at 99%
[05/22/2021-02:32:11] [I] total compute time: 3.03908 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=ocr-46k-v2020-9-21.onnx --shapes=input:32x3x32x256

Hi @zyddnys,

We are unable to reproduce the issue. We could successfully build engine using trtexec with TensorRT v8.0.

[05/26/2021-16:44:33] [I] Total GPU Compute Time: 3.03108 s
[05/26/2021-16:44:33] [I] Explanations of the performance metrics are printed in the verbose logs.
[05/26/2021-16:44:33] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8000] # trtexec --onnx=ocr-46k-v2020-9-21.onnx --shapes=input:32x3x32x256

Please make sure TensorRT installed correctly and if you still face this issue, please share complete error logs using trtexec --verbose.

Thank you.