Description
I am currently trying to run an onnx model using TensorRT, and I have been trying to leverage the engine serialization to speed up loading times. However, I have noticed that I get different results from the model when running the Parsed Onnx Model vs. running the serialize engine.
Here is a plot of confidences over time using the model directly loaded from onnx
I am loading the onnx file as follows:
auto builder = TRTUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(basic_logger));
const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
auto network = TRTUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));
auto parser = TRTUniquePtr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, basic_logger));
auto config = TRTUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());
config->setMaxWorkspaceSize(32 * MEGABYTES_TO_BYTES);
auto engine = std::shared_ptr<nvinfer1::ICudaEngine>(builder->buildEngineWithConfig(*network, *config), TRTDeleter());
Here is the plot of confidences when I use a deserialized engine. Its a similar shape, but noticably 10-20% lower in confidence (which is an output of our model)
To check the serialized model, I am serializing and deserializing the engine as follows
auto serialized_engine = std::shared_ptr<nvinfer1::IHostMemory>(engine->serialize(), TRTDeleter());
auto runtime = std::shared_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(basic_logger), TRTDeleter());
auto deserializedEngine = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(serialized_engine->data(), serialized_engine->size()), TRTDeleter());
To run the model I use the following code (can use either the engine or deserialized engine)
auto execution_context = TRTUniquePtr<nvinfer1::IExecutionContext>(engine->createExecutionContext());
bool executed = execution_context->enqueue(1, bindings, stream, nullptr);
Is there something I am missing when deserializing the model? Given the code I have listed above, I would expect the model to produce the same results as the original model right?
Environment
OS Version: Ubuntu 18.0.4 x86_64
Cuda Driver: 460.80
Onnx 1.6.0
Tensorrt Version: 7.1.3
Model
Here is the onnx model (stripping out the parameters):
model_stripped.onnx (34.3 KB)

