TensorRT run ONNX model with Int8 issue

I try my onnx model in tensorrt follow the link below:
https://elinux.org/TensorRT/YoloV3

command:

trtexec --onnx=my_model.onnx --output=idx:195_convolutional --output=idx:205_convolutional --output=idx:215_convolutional --int8 --batch=1 --device=0

But run error, information is follow:
My use device: Jetson-Xavier
SW: TensorRT-5.1.6, CUDA-10.0, cuDNN-7.5.1

Input filename: elan_qu2.onnx
ONNX IR version: 0.0.5
Opset version: 9
Producer name: ELAN-AIRD
Producer version:
Domain:
Model version: 0
Doc string:

WARNING: ONNX model has a newer ir_version (0.0.5) than this parser was built against (0.0.3).
[W] [TRT] Tensor idx:191_sub is uniformly zero; network calibration failed.
[W] [TRT] Tensor idx:191_sub copy is uniformly zero; network calibration failed.
[E] [TRT] …/builder/cudnnBuilder2.cpp (1791) - Misc Error in createRegionScalesFromTensorScales: -1 (Could not find scales for tensor idx:117_convolutional_batch_normalize_activation copy.)
[E] [TRT] …/builder/cudnnBuilder2.cpp (1791) - Misc Error in createRegionScalesFromTensorScales: -1 (Could not find scales for tensor idx:117_convolutional_batch_normalize_activation copy.)
[E] could not build engine
[E] Engine could not be created
[E] Engine could not be created
&&&& FAILED TensorRT.trtexec # trtexec --onnx=my_model.onnx --input=input_data --output=idx:195_convolutional --output=idx:205_convolutional --output=idx:215_convolutional --int8 --batch=1 --device=0

I can run success of --fp16, please help of --int8.

Hi,

Can you please share the model file to reproduce the issue?

Thanks

How do share file?

The file size > 10MB, please download by link:

https://drive.google.com/open?id=1pAv5mGIhbvFdhM3gOgUawX5NJJDD5_FJ
https://drive.google.com/open?id=1j7MqVvtbt_Cyrxe9Zggljprt-ylNiaXz

Have two model, “my_model173.onnx” can be success, but “my_model174.onnx” not success, i think the problem lies in slice layer “idx:174_slice”, this layer slice from “idx:117_convolutional_batch_normalize_activation”, but I am not sure why can’t get scale

Hi,

The model seems to be working on TRT 6.
Could you please try to upgrade to JetPack 4.3 and try on TRT 6?
https://docs.nvidia.com/jetson/jetpack/release-notes/#jetpack-version

Thanks

Hi SunilJB,

I upgrade to JetPack 4.3 and try on TRT 6 with “my_model174.onnx”, but report error about unknown option.

Commend

[code]nvidia@nvidia:~/Downloads$ trtexec --onnx=my_model.onnx --output=idx:174_activation --int8 --batch=1 --device=0

Information:

&&&& RUNNING TensorRT.trtexec # trtexec --onnx=my_model.onnx --output=idx:174_activation --int8 --batch=1 --device=0
[11/20/2019-15:57:41] [E] Unknown option: --output idx:174_activation
=== Model Options ===
  --uff=<file>                UFF model
  --onnx=<file>               ONNX model
  --model=<file>              Caffe model (default = no model, random weights used)
  --deploy=<file>             Caffe prototxt file
  --output=<name>[,<name>]*   Output names (it can be specified multiple times); at least one output is required for UFF and Caffe
  --uffInput=<name>,X,Y,Z     Input blob name and its dimensions (X,Y,Z=C,H,W), it can be specified multiple times; at least one is required for UFF models
  --uffNHWC                   Set if inputs are in the NHWC layout instead of NCHW (use X,Y,Z=H,W,C order in --uffInput)

=== Build Options ===
  --maxBatch                  Set max batch size and build an implicit batch engine (default = 1)
  --explicitBatch             Use explicit batch sizes when building the engine (default = implicit)
  --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided
                              Note: if any of min/max/opt is missing, the profile will be completed using the shapes 
                                    provided and assuming that opt will be equal to max unless they are both specified;
                                    partially specified shapes are applied starting from the batch size;
                                    dynamic shapes imply explicit batch
                              Input shapes spec ::= Ishp[","spec]
                                           Ishp ::= name":"shape
                                          shape ::= N[["x"N]*"*"]
  --inputIOFormats=spec       Type and formats of the input tensors (default = all inputs in fp32:chw)
  --outputIOFormats=spec      Type and formats of the output tensors (default = all outputs in fp32:chw)
                              IO Formats: spec  ::= IOfmt[","spec]
                                          IOfmt ::= type:fmt
                                          type  ::= "fp32"|"fp16"|"int32"|"int8"
                                          fmt   ::= ("chw"|"chw2"|"chw4"|"hwc8"|"chw16"|"chw32")["+"fmt]
  --workspace=N               Set workspace size in megabytes (default = 16)
  --minTiming=M               Set the minimum number of iterations used in kernel selection (default = 1)
  --avgTiming=M               Set the number of times averaged in each iteration for kernel selection (default = 8)
  --fp16                      Enable fp16 mode (default = disabled)
  --int8                      Run in int8 mode (default = disabled)
  --calib=<file>              Read INT8 calibration cache file
  --safe                      Only test the functionality available in safety restricted flows
  --saveEngine=<file>         Save the serialized engine
  --loadEngine=<file>         Load a serialized engine

=== Inference Options ===
  --batch=N                   Set batch size for implicit batch engines (default = 1)
  --shapes=spec               Set input shapes for explicit batch and dynamic shapes inputs
                              Input shapes spec ::= Ishp[","spec]
                                           Ishp ::= name":"shape
                                          shape ::= N[["x"N]*"*"]
  --iterations=N              Run at least N inference iterations (default = 10)
  --warmUp=N                  Run for N milliseconds to warmup before measuring performance (default = 200)
  --duration=N                Run performance measurements for at least N seconds wallclock time (default = 10)
  --sleepTime=N               Delay inference start with a gap of N milliseconds between launch and compute (default = 0)
  --streams=N                 Instantiate N engines to use concurrently (default = 1)
  --useSpinWait               Actively synchronize on GPU events. This option may decrease synchronization time but increase CPU usage and power (default = false)
  --threads                   Enable multithreading to drive engines with independent threads (default = disabled)
  --useCudaGraph              Use cuda graph to capture engine execution and then launch inference (default = false)
  --buildOnly                 Skip inference perf measurement (default = disabled)

=== Build and Inference Batch Options ===
                              When using implicit batch, the max batch size of the engine, if not given, 
                              is set to the inference batch size;
                              when using explicit batch, if shapes are specified only for inference, they 
                              will be used also as min/opt/max in the build profile; if shapes are 
                              specified only for the build, the opt shapes will be used also for inference;
                              if both are specified, they must be compatible; and if explicit batch is 
                              enabled but neither is specified, the model must provide complete static
                              dimensions, including batch size, for all inputs

=== Reporting Options ===
  --verbose                   Use verbose logging (default = false)
  --avgRuns=N                 Report performance measurements averaged over N consecutive iterations (default = 10)
  --percentile=P              Report performance for the P percentage (0<=P<=100, 0 representing max perf, and 100 representing min perf; (default = 99%)
  --dumpOutput                Print the output tensor(s) of the last inference iteration (default = disabled)
  --dumpProfile               Print profile information per layer (default = disabled)
  --exportTimes=<file>        Write the timing results in a json file (default = disabled)
  --exportProfile=<file>      Write the profile information per layer in a json file (default = disabled)

=== System Options ===
  --device=N                  Select cuda device N (default = 0)
  --useDLACore=N              Select DLA core N for layers that support DLA (default = none)
  --allowGPUFallback          When DLA is enabled, allow GPU fallback for unsupported layers (default = disabled)
  --plugins                   Plugin library (.so) to load (can be specified multiple times)

=== Help ===
  --help                      Print this message
Note: the following options are not fully supported in trtexec: dynamic shapes, multistream/threads, cuda graphs, json logs, and actual data IO
&&&& FAILED TensorRT.trtexec # trtexec --onnx=my_model.onnx --output=idx:174_activation --int8 --batch=1 --device=0

If not get --output, will prompt “Network must have at least one output” error.
Options is change of TRT6? How use --output?

Hi,

“–output” param is mandatory just for UFF and Caffe model.
Check trtexec --help:
Mandatory params for UFF:
–uffInput=,C,H,W Input blob name and its dimensions for UFF parser (can be specified multiple times)
–output= Output blob name (can be specified multiple times)

Mandatory params for Caffe:
–output= Output blob name (can be specified multiple times)

Can you try without “–output” option in “–verbose” mode?

trtexec --onnx=my_model_174.onnx --int8 --batch=1 --device=0 --verbose
trtexec --onnx=my_model_174.onnx --int8 --batch=1 --device=0 --saveEngine=<file> --verbose

Thanks

Sorry i’m late.

Thank your reply, I can work onnx model by this function of “trtexec” with TRT6.0.

I will verify the quantization inference performance.

Thank you.

Hi,
Another problem, we have try one layer of ([in_ch, out_ch, w, h], 128 * 128 * 3 * 3) convolution layer of ONNX, run ./trtexec can be success by DLA, but try one layer of ([ in_ch, out_ch, w, h], 1024 * 1024 * 3 * 3) convolution layer of ONNX, run ./trtexec is not success by DLA, it needs use GPU.

So, DLA not supported 1024 channel number of convolution kernel channel number?

How many are the limits of convolution kernel channel number?