CUDA Error in TensorRT deserializeCudaEngine()

kduncan · June 7, 2021, 2:40pm

Description

TensorRT issues the following error when deserializing an engine on a Tesla P100 machine:

 /_src/rtSafe/resources.h (441) - Cuda Error in loadKernel: -1 (TensorRT internal error)
 INVALID_STATE: std::exception
 INVALID_CONFIG: Deserialize the cuda engine failed.

Here is the stack trace related to the error from gdb:

#0  0x00007fffb3ccabe0 in __cxa_throw () from /lib64/libstdc++.so.6
#1  0x00007fffe39788cf in nvinfer1::throwCudaError(char const*, char const*, int, int, char const*) () from /usr/local/lib64/third_party/libnvinfer.so.7
#2  0x00007fffe37426cb in nvinfer1::rt::ArchiveReadUtils::load(nvinfer1::rt::ReadArchive&, nvinfer1::DriverKernel&, unsigned short) () from /usr/local/lib64/third_party/libnvinfer.so.7
#3  0x00007fffe374d474 in nvinfer1::rt::ArchiveReadUtils::load(nvinfer1::rt::ReadArchive&, nvinfer1::OptionalValue<nvinfer1::rt::cuda::PointWiseV2Runner>&, unsigned short) () from /usr/local/lib64/third_party/libnvinfer.so.7
#4  0x00007fffe37596c4 in ?? () from /usr/local/lib64/third_party/libnvinfer.so.7
#5  0x00007fffe374fa44 in nvinfer1::rt::ArchiveReadUtils::load(nvinfer1::rt::ReadArchive&, nvinfer1::OptionalValue<nvinfer1::rt::Runner>&, unsigned short) () from /usr/local/lib64/third_party/libnvinfer.so.7
#6  0x00007fffe39546fc in nvinfer1::rt::SafeEngine::deserializeCoreEngine(nvinfer1::rt::CoreReadArchive&, std::vector<nvinfer1::rt::EngineLayerAttribute, std::allocator<nvinfer1::rt::EngineLayerAttribute> >&) () from /usr/local/lib64/third_party/libnvinfer.so.7
#7  0x00007fffe36faf32 in nvinfer1::rt::Engine::deserialize(void const*, unsigned long, nvinfer1::IGpuAllocator&, nvinfer1::IPluginFactory*) () from /usr/local/lib64/third_party/libnvinfer.so.7
#8  0x00007fffe3704465 in nvinfer1::Runtime::deserializeCudaEngine(void const*, unsigned long, nvinfer1::IPluginFactory*) () from /usr/local/lib64/third_party/libnvinfer.so.7
#9  0x00000000004b758a in ITensorRTClassifier::Internals::LoadEngine(std::string const&) ()

The funny thing is, the engine was created and serialized on the same machine but when we try loading it, we receive the error detailed above. We successfully loaded the engine on another machine that had the same environment so this error has us mystified! Has anyone ever encountered a similar situation?? Any guidance will be appreciated. Thanks in advance!

Environment

TensorRT Version: 7.2.3.4
GPU Type: Tesla P100
Nvidia Driver Version: 440.64.00
CUDA Version: 10.2
CUDNN Version: 8.1.0
Operating System + Version: CentOS 7.9.2009
Python Version (if applicable): N/A
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 1.9
Baremetal or Container (if container which image + tag): N/A

Relevant Files

N/A

Steps To Reproduce

Convert ONNX model to TensorRT engine
Serialize the engine to an engine plan file
Deserialize the engine in a another application using the deserializeCudaEngine(...) function.

spolisetty · June 8, 2021, 5:33am

Hi @kduncan,

Could you please make sure following. This error could be possible due to one of the following reasons.

Are you using the same TRT version while deserializing the engine, which you used to create one?
The generated plan files are not portable across platforms or TensorRT versions. Plans are specific to the exact GPU model they were built on (in addition to the platforms and the TensorRT version) and must be re-targeted to the specific GPU in case you want to run them on a different GPU.
Mismatched versions of libraries/dependencies. Please make sure CUDA/CuDnn and other libraries installed correctly Alternatively, you could remove the hassle of host-side dependencies by using our NGC Docker containers instead: TensorRT | NVIDIA NGC
Sometimes this can happen due to Out Of Memory errors. You can easily check this with nvidia-smi, and verify that you have plenty of memory available before trying to convert/load the model.

Thanks

kduncan · June 8, 2021, 2:53pm

Hello @spolisetty. Here are my responses to your suggestions.

Yes, I am using the same version of TRT. As I mentioned, the engine is serialized and de-serialized on the same machine. This is not the issue here.
I am currently investigating for mismatched libraries. I’ve been checking this for the past week. I truly believe that this has to be the reason. It’s just that nothing is standing out so far. All of the appropriate libraries appear to be loaded. Also, using NGC containers is not an option at the moment.
No, memory is not an issue. This is the only GPU-heavy application in use on the machine. I even tried lowering the amount of workspace memory for TRT when building the engine, but this had no effect.

Any other suggestions are welcomed. Thanks in advance.

spolisetty · June 9, 2021, 10:16am

Hi @kduncan,

Please check libraries installed correctly or not. You may test on NGC container to identify if it is dependency issue on your local setup. And please reinstall required dependencies correctly.

Thank you.

kduncan · June 10, 2021, 10:18pm

After further investigation, it appears that this was an RPATH issue with the deployed executable. Once the RPATH was cleared before deployment, everything worked fine.

Topic		Replies	Views
TensorRT-7.1.3.4 Deserialize the cuda engine failed TensorRT cuda	9	8342	March 28, 2024
Trouble deserialising a trt engine file TensorRT	1	1566	September 5, 2021
[TRT]: Deserialize the cuda engine failed TensorRT tensorrt	5	3038	December 29, 2021
Trt_yolo_app TensorRT	3	1693	October 30, 2020
Troubleshooting TensorRT Engine Deserialization Issue: Null Engine Jetson Orin Nano tensorrt , cudnn	5	509	July 12, 2024
TensorRT deserialize_cuda_engine() returns a None Object TensorRT tensorrt	6	3845	March 31, 2021
TensorRT C++ engine deserealization failed. Windows 10 TensorRT	3	693	June 28, 2022
Error 3 Cuda initialization while deserializing TensorRT model TensorRT	6	4005	September 7, 2020
What causes the deserializeCudaEngine() fail and how to get the error message? TensorRT tensorrt	8	2565	May 27, 2023
Runtime.deserialize_cuda_engine return a NoneType, how to fix ti? TensorRT tensorrt	10	2648	July 15, 2022

CUDA Error in TensorRT deserializeCudaEngine()

Description

Environment

Relevant Files

Steps To Reproduce

Related topics