Description
TensorRT issues the following error when deserializing an engine on a Tesla P100 machine:
/_src/rtSafe/resources.h (441) - Cuda Error in loadKernel: -1 (TensorRT internal error)
INVALID_STATE: std::exception
INVALID_CONFIG: Deserialize the cuda engine failed.
Here is the stack trace related to the error from gdb:
#0 0x00007fffb3ccabe0 in __cxa_throw () from /lib64/libstdc++.so.6
#1 0x00007fffe39788cf in nvinfer1::throwCudaError(char const*, char const*, int, int, char const*) () from /usr/local/lib64/third_party/libnvinfer.so.7
#2 0x00007fffe37426cb in nvinfer1::rt::ArchiveReadUtils::load(nvinfer1::rt::ReadArchive&, nvinfer1::DriverKernel&, unsigned short) () from /usr/local/lib64/third_party/libnvinfer.so.7
#3 0x00007fffe374d474 in nvinfer1::rt::ArchiveReadUtils::load(nvinfer1::rt::ReadArchive&, nvinfer1::OptionalValue<nvinfer1::rt::cuda::PointWiseV2Runner>&, unsigned short) () from /usr/local/lib64/third_party/libnvinfer.so.7
#4 0x00007fffe37596c4 in ?? () from /usr/local/lib64/third_party/libnvinfer.so.7
#5 0x00007fffe374fa44 in nvinfer1::rt::ArchiveReadUtils::load(nvinfer1::rt::ReadArchive&, nvinfer1::OptionalValue<nvinfer1::rt::Runner>&, unsigned short) () from /usr/local/lib64/third_party/libnvinfer.so.7
#6 0x00007fffe39546fc in nvinfer1::rt::SafeEngine::deserializeCoreEngine(nvinfer1::rt::CoreReadArchive&, std::vector<nvinfer1::rt::EngineLayerAttribute, std::allocator<nvinfer1::rt::EngineLayerAttribute> >&) () from /usr/local/lib64/third_party/libnvinfer.so.7
#7 0x00007fffe36faf32 in nvinfer1::rt::Engine::deserialize(void const*, unsigned long, nvinfer1::IGpuAllocator&, nvinfer1::IPluginFactory*) () from /usr/local/lib64/third_party/libnvinfer.so.7
#8 0x00007fffe3704465 in nvinfer1::Runtime::deserializeCudaEngine(void const*, unsigned long, nvinfer1::IPluginFactory*) () from /usr/local/lib64/third_party/libnvinfer.so.7
#9 0x00000000004b758a in ITensorRTClassifier::Internals::LoadEngine(std::string const&) ()
The funny thing is, the engine was created and serialized on the same machine but when we try loading it, we receive the error detailed above. We successfully loaded the engine on another machine that had the same environment so this error has us mystified! Has anyone ever encountered a similar situation?? Any guidance will be appreciated. Thanks in advance!
Environment
TensorRT Version: 7.2.3.4
GPU Type: Tesla P100
Nvidia Driver Version: 440.64.00
CUDA Version: 10.2
CUDNN Version: 8.1.0
Operating System + Version: CentOS 7.9.2009
Python Version (if applicable): N/A
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 1.9
Baremetal or Container (if container which image + tag): N/A
Relevant Files
N/A
Steps To Reproduce
- Convert ONNX model to TensorRT engine
- Serialize the engine to an engine plan file
- Deserialize the engine in a another application using the
deserializeCudaEngine(...)function.