Unable to do inference of multiple engines in parallel

romilaggarwal611 · May 3, 2022, 8:55am

Description

tl;dr:
I want to run two or more engines in async mode in a way that they run in parallel but instead I’m observing that one engine is running after the first one is finished.

long version:
I have an input stream and I want to run two engines on it for processing. One engine takes 3sec for processing and the other ones takes 100ms. So I wanted to run the larger model in the background and use the output after every ‘n’ frames. I wished that the two engines could run in parallel. So, I should be able to get ~10fps where the smaller model takes 100ms every frame(if this time increases by a small bit then also this is fine) and after 30-40 frames I take the output of the larger model which should have been processed in the background.
Both the models are run using enqueue function(enqueue for one and enqueueV2 for other). From the timestamps I’m seeing that the models are not running in parallel. Rather, the second model starts working after the first one is finished. If I enqueue for the larger model and then for the lower model I see that the smaller model is taking 3100ms and the larger model’s output is also ready which could only mean that one is running after the other.
I’ve tried using cudaMallocHost instead of cudaMalloc and tried using cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking) instead of cudaStreamCreate(&stream_). Below are example class functions showing how I use the functions. (note, I have different classes for the two engines and thus different context).

bool SampleInference::build() // called during the init
{
    std::vector<char> trtModelStream_;
    size_t size{ 0 };

    std::ifstream file("engine1.trt", std::ios::binary);

    if (file.good())
    {
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream_.resize(size);
        file.read(trtModelStream_.data(), size);
        file.close();
    }

    IRuntime* runtime = createInferRuntime(sample::gLogger);
    
    mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(trtModelStream_.data(), size, nullptr), samplesCommon::InferDeleter());
    
    if (!mEngine)
    {
        return false;
    }
    
    context1 = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
    
    cudaMalloc(&buffers1[0],   3 * 512 * 512 * sizeof(float));
    cudaMalloc(&buffers1[1], 150 * 512 * 512 * sizeof(float));
    
    cudaMallocHost(&buffers2[0], 3 * 512 * 512 * sizeof(float));
    cudaMallocHost(&buffers2[1], 150 * 512 * 512 * sizeof(float));
    
    cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking);

    //cudaMalloc(&buffers2[1], 150 * 512 * 512 * sizeof(float));
    //cudaMalloc(&buffers2[0], 3 * 512 * 512 * sizeof(float));
    
    //cudaStreamCreate(&stream_);
   
    return true;
}

//Function that performs inference
bool SampleInference::infer_enqueue(cv::Mat &inputs_fin)
{
    //cudaStreamCreate(&stream_); //moved to build() which is called during init
    input_img = inputs_fin;
    bool status_processInput = processInput(input_img);
    
    cudaMemcpyAsync(buffers1[0], (float*)buffers2[0],    3 * 512 * 512 * sizeof(float), cudaMemcpyHostToDevice, stream_);

    bool status_inference = context1->enqueueV2(buffers1, stream_, nullptr);
    
    return true;
}

cv::Mat SampleInferenceSEGMENTER::infer_dequeue()
{
    cudaMemcpyAsync((float*)buffer2[1], buffers1[1], 150 * 512 * 512 * sizeof(float), cudaMemcpyDeviceToHost, stream_);
    
    cudaStreamSynchronize(stream_);
    //cudaStreamDestroy(stream_); // moved to a destroy function called in the end
    cv::Mat output_fin = processOutput(input_img);
    
    return output_fin;
}

Environment

TensorRT Version: 7.1.3.0
GPU Type: Jetson Nano (NVIDIA Tegra X1 (nvgpu)/integrated)
Nvidia Driver Version: L4T 32.4.4 [ JetPack 4.4.1 ]
CUDA Version: 10.2.89
CUDNN Version: 8.0.0.180
Operating System + Version: 18.04.6 LTS
OpenCV Version: 4.4.0

Steps To Reproduce

Run infer_enqueue function for the larger engine, then for the smaller engine and then run the infer_dequeue function for the smaller engine.

NVES · May 3, 2022, 9:07am

Hi,

The below links might be useful for you.
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#thread-safety

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.

or

raise the query in Triton Inference Server Github instance issues section.

Thanks!

romilaggarwal611 · May 3, 2022, 9:29am

Due to other reasons, I can’t use Deepstream or Triton.

In the link you shared, it states that:

The TensorRT builder may only be used by one thread at a time. If you need to run multiple builds simultaneously, you will need to create multiple builders.
The TensorRT runtime can be used by multiple threads simultaneously, so long as each object uses a different execution context.

Since I have different classes for the 2 different engines, I already have 2 different context.
Or does this mean that IRuntime* runtime = createInferRuntime(sample::gLogger); should be shared between the 2 classes/engines while having a different contexts.

Any help is greatly appreciated.

spolisetty · May 6, 2022, 11:48am

Hi,

Could you please try on the latest TensorRT version 8.4

Thank you.

Topic		Replies	Views
Latency when running TensorRT engine on two GPU TensorRT	9	1221	August 24, 2020
Concurrent tensorRT engines TensorRT jetson	1	380	December 5, 2022
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1286	September 6, 2024
TensorRT Parallel Inference /concurrent inferecing TensorRT tensorrt	10	3850	October 13, 2022
Multiple threads execution with different engines in tensorrt TensorRT tensorrt	3	2359	December 13, 2022
TensorRT multi stream TensorRT	3	2469	February 29, 2024
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	945	May 5, 2021
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2416	March 30, 2023
Run inference on a batch of images & parallel inference using cuda on python threads TensorRT tensorrt , cuda	6	2194	January 6, 2022
how to run trt in multithreading？ Jetson TX2	15	7890	October 18, 2021

Unable to do inference of multiple engines in parallel

Description

Environment

Steps To Reproduce

Related topics