[polygraphy + transformer model] mark all tensorrt nodes as output crashes

Description

Polygraphy crashes when trying to mark all TensorRT nodes as output.
When removing the --trt-outputs mark all from command line, it works.
Applying polygraphy sanitizing command didn’t helped, nor onnx-simplifier.

Environment

TensorRT Version : 8.2 (preview)
NVIDIA GPU : 3090 RTX
NVIDIA Driver Version : 495.29.05
CUDA Version : 11.5
CUDNN Version : 8.3.0.98
Operating System : Linux Ubuntu 21.04
Python Version (if applicable) : 3.9
PyTorch Version (if applicable) : 1.10
Baremetal or Container (if so, version) : Baremetal

Relevant Files

https://drive.google.com/file/d/14wiCeBPTGtWRFdr8Z7-AVtlpCciHojxw/view?usp=sharing

Logs :

/mnt/workspace/fast_transformer$ polygraphy run triton_models/model-original.onnx --trt --onnxrt     --fp16     --seed 123     --val-range input_ids:[0,1000] attention_mask:[1,1] token_type_ids:[1,1]     --input-shapes input_ids:[1,16] attention_mask:[1,16] token_type_ids:[1,16]     --workspace=12G     --validate     --warm-up 200     --iterations 1     --atol 1e-1     --onnx-outputs mark all     --trt-outputs mark all --verbose
[V] Loaded Module: polygraphy.util    | Path: ['/home/geantvert/.local/share/virtualenvs/fast_transformer/lib/python3.9/site-packages/polygraphy/util']
[V] Model: triton_models/model-original.onnx
[V] Loaded Module: polygraphy         | Version: 0.33.0   | Path: ['/home/geantvert/.local/share/virtualenvs/fast_transformer/lib/python3.9/site-packages/polygraphy']
[V] Loaded Module: tensorrt           | Version: 8.2.0.6  | Path: ['/home/geantvert/.local/share/virtualenvs/fast_transformer/lib/python3.9/site-packages/tensorrt']
[I] Will generate inference input data according to provided TensorMetadata: {input_ids [shape=(1, 16)],
     attention_mask [shape=(1, 16)],
     token_type_ids [shape=(1, 16)]}
[I] trt-runner-N0-11/21/21-21:48:35     | Activating and starting inference
[11/21/2021-21:48:35] [TRT] [I] [MemUsageChange] Init CUDA: CPU +445, GPU +0, now: CPU 460, GPU 836 (MiB)
[11/21/2021-21:48:36] [TRT] [I] ----------------------------------------------------------------
[11/21/2021-21:48:36] [TRT] [I] Input filename:   /mnt/workspace/fast_transformer/triton_models/model-original.onnx
[11/21/2021-21:48:36] [TRT] [I] ONNX IR version:  0.0.7
[11/21/2021-21:48:36] [TRT] [I] Opset version:    12
[11/21/2021-21:48:36] [TRT] [I] Producer name:    pytorch
[11/21/2021-21:48:36] [TRT] [I] Producer version: 1.10
[11/21/2021-21:48:36] [TRT] [I] Domain:           
[11/21/2021-21:48:36] [TRT] [I] Model version:    0
[11/21/2021-21:48:36] [TRT] [I] Doc string:       
[11/21/2021-21:48:36] [TRT] [I] ----------------------------------------------------------------
[11/21/2021-21:48:36] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/21/2021-21:48:37] [TRT] [W] Output type must be INT32 for shape outputs
[11/21/2021-21:48:37] [TRT] [W] Output type must be INT32 for shape outputs
[11/21/2021-21:48:37] [TRT] [W] Output type must be INT32 for shape outputs
[11/21/2021-21:48:37] [TRT] [W] Output type must be INT32 for shape outputs
[V] Marking 677 tensors as outputs
[V]     Setting TensorRT Optimization Profiles
[V]     Input tensor: input_ids (dtype=DataType.INT32, shape=(-1, -1)) | Setting input tensor shapes to: (min=[1, 16], opt=[1, 16], max=[1, 16])
[V]     Input tensor: token_type_ids (dtype=DataType.INT32, shape=(-1, -1)) | Setting input tensor shapes to: (min=[1, 16], opt=[1, 16], max=[1, 16])
[V]     Input tensor: attention_mask (dtype=DataType.INT32, shape=(-1, -1)) | Setting input tensor shapes to: (min=[1, 16], opt=[1, 16], max=[1, 16])
[I]     Configuring with profiles: [Profile().add(input_ids, min=[1, 16], opt=[1, 16], max=[1, 16]).add(attention_mask, min=[1, 16], opt=[1, 16], max=[1, 16]).add(token_type_ids, min=[1, 16], opt=[1, 16], max=[1, 16])]
[I] Building engine with configuration:
    Workspace            | 12884901888 bytes (12288.00 MiB)
    Precision            | TF32: False, FP16: True, INT8: False, Strict Types: False
    Tactic Sources       | ['CUBLAS', 'CUBLAS_LT', 'CUDNN']
    Safety Restricted    | False
    Profiles             | 1 profile(s)
[11/21/2021-21:48:37] [TRT] [I] [MemUsageSnapshot] Builder begin: CPU 775 MiB, GPU 912 MiB
[11/21/2021-21:48:38] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +805, GPU +350, now: CPU 1581, GPU 1262 (MiB)
[11/21/2021-21:48:38] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +125, GPU +58, now: CPU 1706, GPU 1320 (MiB)
[11/21/2021-21:48:38] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[11/21/2021-21:48:38] [TRT] [E] 2: [optimizer.cpp::getFormatRequirements::3815] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. no supported formats)
[11/21/2021-21:48:38] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::561] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
[!] Invalid Engine. Please ensure the engine was built correctly

Steps To Reproduce

polygraphy run triton_models/model-original.onnx --trt --onnxrt \
    --fp16   --seed 123 \
    --val-range input_ids:[0,1000] attention_mask:[1,1] token_type_ids:[1,1] \
    --input-shapes input_ids:[1,16] attention_mask:[1,16] token_type_ids:[1,16] \
    --workspace=12G \
    --validate \
    --warm-up 200 \
    --iterations 1 \
    --atol 1e-1 \
    --onnx-outputs mark all \
    --trt-outputs mark all --verbose

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Hi,

My issue is not related to int8 quantization (there is no quantization in this issue), I just want to see output of each node in my model (and compare it with onnx runtime).
The model works well in FP16, it’s just that marking all tensorrt nodes as output make polygraphy crash.

And to be, clear minimal crash case is (no FP16 or whatever):

polygraphy run triton_models/model-original.onnx \
    --trt \
    --val-range input_ids:[0,1000] attention_mask:[1,1] token_type_ids:[1,1] \
    --input-shapes input_ids:[1,16] attention_mask:[1,16] token_type_ids:[1,16] \
    --workspace=12G \
    --trt-outputs mark all \
    --verbose

Hi,

We recommend you to please post your concern on Issues · NVIDIA/TensorRT · GitHub to get better help.

Thank you.