Unable to do inference of multiple engines in parallel

Description

tl;dr:
I want to run two or more engines in async mode in a way that they run in parallel but instead I’m observing that one engine is running after the first one is finished.

long version:
I have an input stream and I want to run two engines on it for processing. One engine takes 3sec for processing and the other ones takes 100ms. So I wanted to run the larger model in the background and use the output after every ‘n’ frames. I wished that the two engines could run in parallel. So, I should be able to get ~10fps where the smaller model takes 100ms every frame(if this time increases by a small bit then also this is fine) and after 30-40 frames I take the output of the larger model which should have been processed in the background.
Both the models are run using enqueue function(enqueue for one and enqueueV2 for other). From the timestamps I’m seeing that the models are not running in parallel. Rather, the second model starts working after the first one is finished. If I enqueue for the larger model and then for the lower model I see that the smaller model is taking 3100ms and the larger model’s output is also ready which could only mean that one is running after the other.
I’ve tried using cudaMallocHost instead of cudaMalloc and tried using cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking) instead of cudaStreamCreate(&stream_). Below are example class functions showing how I use the functions. (note, I have different classes for the two engines and thus different context).

bool SampleInference::build() // called during the init
{
    std::vector<char> trtModelStream_;
    size_t size{ 0 };

    std::ifstream file("engine1.trt", std::ios::binary);

    if (file.good())
    {
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream_.resize(size);
        file.read(trtModelStream_.data(), size);
        file.close();
    }

    IRuntime* runtime = createInferRuntime(sample::gLogger);
    
    mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(trtModelStream_.data(), size, nullptr), samplesCommon::InferDeleter());
    
    if (!mEngine)
    {
        return false;
    }
    
    context1 = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
    
    cudaMalloc(&buffers1[0],   3 * 512 * 512 * sizeof(float));
    cudaMalloc(&buffers1[1], 150 * 512 * 512 * sizeof(float));
    
    cudaMallocHost(&buffers2[0], 3 * 512 * 512 * sizeof(float));
    cudaMallocHost(&buffers2[1], 150 * 512 * 512 * sizeof(float));
    
    cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking);

    //cudaMalloc(&buffers2[1], 150 * 512 * 512 * sizeof(float));
    //cudaMalloc(&buffers2[0], 3 * 512 * 512 * sizeof(float));
    
    //cudaStreamCreate(&stream_);
   
    return true;
}

//Function that performs inference
bool SampleInference::infer_enqueue(cv::Mat &inputs_fin)
{
    //cudaStreamCreate(&stream_); //moved to build() which is called during init
    input_img = inputs_fin;
    bool status_processInput = processInput(input_img);
    
    cudaMemcpyAsync(buffers1[0], (float*)buffers2[0],    3 * 512 * 512 * sizeof(float), cudaMemcpyHostToDevice, stream_);

    bool status_inference = context1->enqueueV2(buffers1, stream_, nullptr);
    
    return true;
}

cv::Mat SampleInferenceSEGMENTER::infer_dequeue()
{
    cudaMemcpyAsync((float*)buffer2[1], buffers1[1], 150 * 512 * 512 * sizeof(float), cudaMemcpyDeviceToHost, stream_);
    
    cudaStreamSynchronize(stream_);
    //cudaStreamDestroy(stream_); // moved to a destroy function called in the end
    cv::Mat output_fin = processOutput(input_img);
    
    return output_fin;
}

Environment

TensorRT Version: 7.1.3.0
GPU Type: Jetson Nano (NVIDIA Tegra X1 (nvgpu)/integrated)
Nvidia Driver Version: L4T 32.4.4 [ JetPack 4.4.1 ]
CUDA Version: 10.2.89
CUDNN Version: 8.0.0.180
Operating System + Version: 18.04.6 LTS
OpenCV Version: 4.4.0

Steps To Reproduce

Run infer_enqueue function for the larger engine, then for the smaller engine and then run the infer_dequeue function for the smaller engine.

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

Due to other reasons, I can’t use Deepstream or Triton.

In the link you shared, it states that:

The TensorRT builder may only be used by one thread at a time. If you need to run multiple builds simultaneously, you will need to create multiple builders.
The TensorRT runtime can be used by multiple threads simultaneously, so long as each object uses a different execution context.

Since I have different classes for the 2 different engines, I already have 2 different context.
Or does this mean that IRuntime* runtime = createInferRuntime(sample::gLogger); should be shared between the 2 classes/engines while having a different contexts.

Any help is greatly appreciated.

Hi,

Could you please try on the latest TensorRT version 8.4

Thank you.