Is TensorRT safe to create engine & context in one thread, and execute in another thread?

Description

I am building a neural-network inference result visualization web using C++ / TensorRT / httplib.

When user opens a project, the server’s reaction is to cudaSetDevice(assigned gpu id), deserialize some tensorrt engine, and create corresponding context, in some randomly generated thread.

When user queries a image’s inference result, the server’s reaction is to get the corresponding’s context’s pointer, do the inference, and send the result back, in some other randomly generated thread. std::mutex and std::lock are used to make use NO CONCURRENT CALLS to context->execute() or context->enqueue() or etc…

The above process works fine in ONE card server.
But I am worried about multiple-gpu environment, or some edge case may cause problems.

Is TensorRT safe to create engine & context in one thread, and execute in another thread?

Environment

TensorRT Version: 8.4
GPU Type: 2080TI
Nvidia Driver Version: 510.47.03
CUDA Version: 11.6
CUDNN Version: 8.3
Operating System + Version: Ubuntu18.04

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!