Description
Occasionally when calling destroy()
on an ICudaEngine
a nvinfer1::MyelinError
exception is thrown. The method is marked as noexcept
so this results in std::terminate being called, with no way to catch and handle the exception. The exception contains the following error:
myelin/myelinGraphContext.h (40) - Myelin Error in ~MyelinGraphContext: 3 ()
Environment
TensorRT Version: Occurs in both 7.2.1.6 and 7.2.3.4
GPU Type: GTX 1060
Nvidia Driver Version: 460.39
CUDA Version: 11.1
CUDNN Version: 8.0
Operating System + Version: Ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
N/A
Steps To Reproduce
- Create and destroy ICudaEngine in a loop
Stack trace:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f7338f25921 in __GI_abort () at abort.c:79
#2 0x00007f733957a957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007f7339580ae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f733957fb49 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007f73395804b8 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f73392e6573 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7 0x00007f73392e6df5 in _Unwind_Resume () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8 0x00007f72e543142d in nvinfer1::throwMyelinError(char const*, char const*, int, int, char const*) () from /home/myles/core-cmake/cmake-build-debug/lib/../lib/libnvinfer.so.7
#9 0x00007f72e54279c1 in nvinfer1::rt::MyelinRunner::~MyelinRunner() () from /home/myles/core-cmake/cmake-build-debug/lib/../lib/libnvinfer.so.7
#10 0x00007f72e54279f9 in nvinfer1::rt::MyelinRunner::~MyelinRunner() () from /home/myles/core-cmake/cmake-build-debug/lib/../lib/libnvinfer.so.7
#11 0x00007f72e53b5aa6 in nvinfer1::rt::SafeEngine::~SafeEngine() () from /home/myles/core-cmake/cmake-build-debug/lib/../lib/libnvinfer.so.7
#12 0x00007f72e510ca6b in nvinfer1::rt::Engine::~Engine() () from /home/myles/core-cmake/cmake-build-debug/lib/../lib/libnvinfer.so.7
#13 0x00007f72e510cb99 in nvinfer1::rt::Engine::~Engine() () from /home/myles/core-cmake/cmake-build-debug/lib/../lib/libnvinfer.so.7
...
Hi @myles.inglis,
Could you please share issue repro scripts for better assistance.
Thank you.
I’ve been doing some more testing on this and it seems that it is a thread safety issue with the IExecutionContext
s. I’ve been trying to create a minimal example that you can run but it has proven difficult without our entire integration code and internal models.
It seems that doing inference on multiple execution contexts from the same engine is thread safe, but creating (and possibly destroying?) execution contexts is not thread safe. For example:
engine.reset(runtime->deserializeCudaEngine(engine_data.data(), engine_data.size()));
std::vector<std::thread> threads;
for (int j = 0; j < 8; j++) {
threads.emplace_back([&engine]() {
auto exec_context = TRTPointer<nvinfer1::IExecutionContext>(engine->createExecutionContext());
// Do inference
});
}
Causes assertions/std::terminate, however:
engine.reset(runtime->deserializeCudaEngine(engine_data.data(), engine_data.size()));
std::vector<std::thread> threads;
for (int j = 0; j < 8; j++) {
auto exec_context = TRTPointer<nvinfer1::IExecutionContext>(engine->createExecutionContext());
threads.emplace_back([exec_context = std::move(exec_context)]() {
// Do inference
});
}
Is this expected?
Regardless, it would be preferable if the exceptions were caught in the noexcept
methods or not thrown at all avoid crashing the host application.
Hi @myles.inglis,
We recommend you to refer Best Practices For TensorRT Performance document.
The TensorRT builder may only be used by one thread at a time. If you need to run multiple builds simultaneously, you will need to create multiple builders. The TensorRT runtime can be used by multiple threads simultaneously, so long as each object uses a different execution context.
Note: Plugins are shared at the engine level, not the execution context level, and thus plugins which may be used simultaneously by multiple threads need to manage their resources in a thread-safe manner. This is however not required for plugins based on IPluginV2Ext and derivative interfaces since we clone these plugin when ExecutionContext is created.
The TensorRT library pointer to the logger is a singleton within the library. If using multiple builder or runtime objects, use the same logger, and ensure that it is thread-safe.
Thank you.
Hi, yes we are following these best practices already, we only do one build at a time and we use a different execution context for each thread. My question is not answered in this document though.
The issue is the construction of these execution contexts appears not to be thread safe from my experience, our workaround is to construct the all execution contexts up front and pass them into the worker threads. There’s nothing in that best practices document or anywhere else in the documentation that states that this is not thread safe however