TensorRT8 INT8 (signed char) I/O interface for ONNX model

Description

Dear NVIDIA support team,

I’am trying to run the ONNX parsed quantized Serialized TensorRT8 model with INT8 I/O interface.

In my environment, the memory bandwidth is importand problem, so I need the INT8 (signed char) interface.

I have my own quantization and serialization code using onnx model which is just followed NVIDIA’s samples and ran successfully. (FP32 interface)

I’m watching the Adding A Custom Layer That Supports INT8 I/O To Your Network In TensorRT example in the TensorRT/samples, but the thing is that, the code still uses FP32 interface, which is INT8 result is included. (just using static_cast)

So my question is, Is there any way to use the INT8 (signed char, 1 Byte) interface between Device and Host?
(using C++ API)

Environment

TensorRT Version: 8.2
GPU Type: Compute Capability 7.5
Nvidia Driver Version: 470 (maybe)
CUDA Version: 11.4
CUDNN Version: 8.2
Operating System + Version: Ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

I’ve tried the network->addInput but the errror is occurred below:

 [network.cpp::addInput::1507] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/network.cpp::addInput::1507, condition: inName != knownInput->getName())

The input layer name is correct, because the I used network->getInput(0)->getName() as the input layer name of network->addInput.

So, I tried the another addInput name like input_test, and the error upper is disappeared.

But another problem is occurred.

Whenever I tried to serialize the engine, It detects the added input_test as an Unused Input: input_test and prints out the errors below (the code is including INT8 quantization calibration):

[W] [TRT] [RemoveDeadLayers] Input Tensor input  is unused or used only at compile-time, but is not being removed.
...
 [calibrator.cpp::calibrateEngine::1132] Error Code 2: Internal Error (Assertion lastInput + 1 == nbInputs failed. )
[builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

The examples on online(google) are uses two networks like “preprocess network” and “prediction network”.
Is it the only way to INT8 I/O?

or, is there any other way?

If there is any way to connect the input_test to the actual_input then please let me know.

please help me :)

Hi,

Could you please share with us minimal issue repro model, scripts and complete error logs for better debugging.

Thank you.

The name of input layer is “input” and output layer names are “output0, output1, output2” of my ONNX model.

I’ve utilized the sample codes in the TensorRT C++ packages.
the codes in the construction network phase,

// define builder(nvinfer1::createInferBuilder())
const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));
config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
parser = SampleUniquePtr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, sample::gLogger.getTRTLogger()));
auto parsed = parser->parseFromFile("path_to_onnx_file.onnx", static_cast<int>(sample::gLogger.getReportableSeverity()));
builder->setMaxBatchSize(mParams.batchSize);
config->setMaxWorkspaceSize(10_GiB);
std::unique_ptr<IInt8Calibrator> calibrator;
//=====================
// DO SOMETHING TO THE NETWORK
//=====================

// DO THE CALIBRATION TO INT8

the error messages are little different when I put set the input layer name.
in the DO SOMETHING TO THE NETWORK,

    nvinfer1::ITensor* input_layer = network->getInput(0);
    const char* input_name = "input"; // as same to input_layer->getName();
    nvinfer1::ITensor* input_int8_layer = network->addInput(input_name , DataType::kINT8, input_layer->getDimensions());
    ASSERT(input_int8_layer != nullptr);

the the error message is here:

[02/15/2022-09:21:44] [I] Building and running a GPU inference engine for MY_MODEL
The model is parsed from path_to_onnx_file.onnx file
[02/15/2022-09:21:44] [I] [TRT] [MemUsageChange] Init CUDA: CPU +324, GPU +0, now: CPU 335, GPU 501 (MiB)
[02/15/2022-09:21:45] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 335 MiB, GPU 501 MiB
[02/15/2022-09:21:45] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 469 MiB, GPU 535 MiB
[02/15/2022-09:21:45] [I] [TRT] ----------------------------------------------------------------
[02/15/2022-09:21:45] [I] [TRT] Input filename: path_to_onnx_file.onnx
[02/15/2022-09:21:45] [I] [TRT] ONNX IR version:  0.0.7
[02/15/2022-09:21:45] [I] [TRT] Opset version:    9
[02/15/2022-09:21:45] [I] [TRT] Producer name:    pytorch
[02/15/2022-09:21:45] [I] [TRT] Producer version: 1.10
[02/15/2022-09:21:45] [I] [TRT] Domain:           
[02/15/2022-09:21:45] [I] [TRT] Model version:    0
[02/15/2022-09:21:45] [I] [TRT] Doc string:       
[02/15/2022-09:21:45] [I] [TRT] ----------------------------------------------------------------
[02/15/2022-09:21:45] [E] [TRT] [network.cpp::addInput::1507] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/network.cpp::addInput::1507, condition: inName != knownInput->getName())
[02/15/2022-09:21:45] [E] Assertion failure: input_int8_layer != nullptr

but if I use the network->addInput to another name like below, the error is different.
(other remain codes are exactly same)

const char* input_name = "input_temp"; // as same to input_layer->getName();

the error code is below:

[02/15/2022-09:24:58] [I] Building and running a GPU inference engine for MY_MODEL
The model is parsed from path_to_onnx_file.onnx file
[02/15/2022-09:24:58] [I] [TRT] [MemUsageChange] Init CUDA: CPU +324, GPU +0, now: CPU 335, GPU 501 (MiB)
[02/15/2022-09:24:59] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 335 MiB, GPU 501 MiB
[02/15/2022-09:24:59] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 469 MiB, GPU 535 MiB
[02/15/2022-09:24:59] [I] [TRT] ----------------------------------------------------------------
[02/15/2022-09:24:59] [I] [TRT] Input filename: path_to_onnx_file.onnx
[02/15/2022-09:24:59] [I] [TRT] ONNX IR version:  0.0.7
[02/15/2022-09:24:59] [I] [TRT] Opset version:    9
[02/15/2022-09:24:59] [I] [TRT] Producer name:    pytorch
[02/15/2022-09:24:59] [I] [TRT] Producer version: 1.10
[02/15/2022-09:24:59] [I] [TRT] Domain:           
[02/15/2022-09:24:59] [I] [TRT] Model version:    0
[02/15/2022-09:24:59] [I] [TRT] Doc string:       
[02/15/2022-09:24:59] [I] [TRT] ----------------------------------------------------------------
[02/15/2022-09:24:59] [I] Using Entropy Calibrator 2
[02/15/2022-09:24:59] [W] [TRT] Unused Input: input_temp
[02/15/2022-09:24:59] [W] [TRT] [RemoveDeadLayers] Input Tensor input_temp is unused or used only at compile-time, but is not being removed.
[02/15/2022-09:24:59] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.5.2
[02/15/2022-09:24:59] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +484, GPU +206, now: CPU 977, GPU 749 (MiB)
[02/15/2022-09:24:59] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +401, GPU +204, now: CPU 1378, GPU 953 (MiB)
[02/15/2022-09:24:59] [I] [TRT] Timing cache disabled. Turning it on will improve builder speed.
[02/15/2022-09:25:01] [I] [TRT] Detected 2 inputs and 3 output network tensors.
[02/15/2022-09:25:01] [I] [TRT] Total Host Persistent Memory: 111728
[02/15/2022-09:25:01] [I] [TRT] Total Device Persistent Memory: 0
[02/15/2022-09:25:01] [I] [TRT] Total Scratch Memory: 0
[02/15/2022-09:25:01] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 136 MiB
[02/15/2022-09:25:01] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 16.4134ms to assign 7 blocks to 141 nodes requiring 139345920 bytes.
[02/15/2022-09:25:01] [I] [TRT] Total Activation Memory: 139345920
[02/15/2022-09:25:01] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.5.2
[02/15/2022-09:25:01] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1816, GPU 1133 (MiB)
[02/15/2022-09:25:01] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1816, GPU 1141 (MiB)
[02/15/2022-09:25:01] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.5.2
[02/15/2022-09:25:01] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1816, GPU 1117 (MiB)
[02/15/2022-09:25:01] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1816, GPU 1125 (MiB)
[02/15/2022-09:25:01] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +132, now: CPU 0, GPU 141 (MiB)
[02/15/2022-09:25:01] [E] [TRT] 2: [calibrator.cpp::calibrateEngine::1132] Error Code 2: Internal Error (Assertion lastInput + 1 == nbInputs failed. )
[02/15/2022-09:25:01] [E] [TRT] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )

The thing I want to do is add the INT8 input layer to my INT8 quantized serialized TensorRT model.
Because it reduces the input bandwidth to quarter.
(You may know that the default TensorRT INT8 quantization and serialization inputs the FP32 data.)

What could be the problem?

Hi, Please refer to the below links to perform inference in INT8

Thanks!