I am currently trying to run an onnx model using TensorRT, and I have been trying to leverage the engine serialization to speed up loading times. However, I have noticed that I get different results from the model when running the Parsed Onnx Model vs. running the serialize engine.

Here is a plot of confidences over time using the model directly loaded from onnx

I am loading the onnx file as follows:

 auto builder = TRTUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(basic_logger));
 const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
 auto network = TRTUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));
 auto parser = TRTUniquePtr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, basic_logger));
 auto config = TRTUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
 config->setMaxWorkspaceSize(32 * MEGABYTES_TO_BYTES);
  auto engine = std::shared_ptr<nvinfer1::ICudaEngine>(builder->buildEngineWithConfig(*network, *config), TRTDeleter());

Here is the plot of confidences when I use a deserialized engine. Its a similar shape, but noticably 10-20% lower in confidence (which is an output of our model)

To check the serialized model, I am serializing and deserializing the engine as follows

 auto serialized_engine = std::shared_ptr<nvinfer1::IHostMemory>(engine->serialize(), TRTDeleter());
 auto runtime = std::shared_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(basic_logger), TRTDeleter());
 auto deserializedEngine = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(serialized_engine->data(), serialized_engine->size()), TRTDeleter());

To run the model I use the following code (can use either the engine or deserialized engine)

auto execution_context = TRTUniquePtr<nvinfer1::IExecutionContext>(engine->createExecutionContext());
bool executed = execution_context->enqueue(1, bindings, stream, nullptr);

Is there something I am missing when deserializing the model? Given the code I have listed above, I would expect the model to produce the same results as the original model right?


OS Version: Ubuntu 18.0.4 x86_64
Cuda Driver: 460.80
Onnx 1.6.0
Tensorrt Version: 7.1.3


Here is the onnx model (stripping out the parameters):
So I have only noticed this when running using the c++ API. On my Jetson Xavier, I have verified that trtexec produces the same results with the serialized model before / after saving the engine.

I will try making a small c++ unit test to verify the approach.

