Description
I am building a neural-network inference result visualization web using C++ / TensorRT / httplib.
When user opens a project, the server’s reaction is to cudaSetDevice(assigned gpu id), deserialize some tensorrt engine, and create corresponding context, in some randomly generated thread.
When user queries a image’s inference result, the server’s reaction is to get the corresponding’s context’s pointer, do the inference, and send the result back, in some other randomly generated thread. std::mutex and std::lock are used to make use NO CONCURRENT CALLS to context->execute() or context->enqueue() or etc…
The above process works fine in ONE card server.
But I am worried about multiple-gpu environment, or some edge case may cause problems.
Is TensorRT safe to create engine & context in one thread, and execute in another thread?
Environment
TensorRT Version: 8.4
GPU Type: 2080TI
Nvidia Driver Version: 510.47.03
CUDA Version: 11.6
CUDNN Version: 8.3
Operating System + Version: Ubuntu18.04