Multiple threads execution with different engines in tensorrt

Description

I am trying to run tensorrt in multiple threads with multiple engines on same GPU. I have following architecture-

  1. A pre built INT8 engine using trtexec from YOLOV7 onnx model. trtexec passes succesfully.
  2. A main thread which reads this model and creates array of Engine object. Each object has its own ICudaEngine, IExecutionContext and non blocking cuda stream. Main thread initializes these objects and keeps it in array.
  3. Now in execution after initialization parallel calls happen to these Engine objects with engineid to use. It does async memory copy. Uses enqueueV2 and CudaStreamSynchronisze as well. And returns result to main thread.

Now when I run this setup on NVIDIA MX450 it behaves like serial operation. I could see two streams happening in concurrent way(Only copy not execution) and returning results properly.

But when we run it on RTX A2000 it gives following error most of the times.
1: [cudaDriverHelpers.cpp::nvinfer1::CuDeleter<struct CUmod_st *,&enum cudaError_enum __cdecl nvinfer1::cuModuleUnloadWrapper(struct CUmod_st *)>::operator ()::29] Error Code 1: Cuda Driver (an illegal instruction was encountered)
followed by
CUDA initialization failure with error: 715
In some attempts I was able to get it running post this error. But that time results(wrong) fluctuates for some time and then gives proper one. I could see that cuda STARTS with lower usage when results are fluctuating between threads and then it reaches maximum level and then results are stable and correct.

Now when I restrict this call to only one thread it works as expected with maximum speed and no fluctuation is observed.
Further when we run two threads with two different engines in serial way then results are correct and it takes twice the time as expected.

Can someone help in resolving this issue?

Environment : Windows 10.

TensorRT Version : 8.4.3
GPU Type : RTX A2000 6 GB
Nvidia Driver Version: 527.27
CUDA Version: 11.8
CUDNN Version: 8.6
Operating System + Version: Windows 10 64 bit
Python Version (if applicable): –
TensorFlow Version (if applicable): –
PyTorch Version (if applicable): –
Baremetal or Container (if container which image + tag): –

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi, Please refer to the below links to perform inference in INT8

Thanks!

I dont think INT8 conversion is problem. Otherwise it would not work in single thread too. Also the same problem exists even with FP32

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!