Description
tl;dr:
I want to run two or more engines in async mode in a way that they run in parallel but instead I’m observing that one engine is running after the first one is finished.
long version:
I have an input stream and I want to run two engines on it for processing. One engine takes 3sec for processing and the other ones takes 100ms. So I wanted to run the larger model in the background and use the output after every ‘n’ frames. I wished that the two engines could run in parallel. So, I should be able to get ~10fps where the smaller model takes 100ms every frame(if this time increases by a small bit then also this is fine) and after 30-40 frames I take the output of the larger model which should have been processed in the background.
Both the models are run using enqueue function(enqueue for one and enqueueV2 for other). From the timestamps I’m seeing that the models are not running in parallel. Rather, the second model starts working after the first one is finished. If I enqueue for the larger model and then for the lower model I see that the smaller model is taking 3100ms and the larger model’s output is also ready which could only mean that one is running after the other.
I’ve tried using cudaMallocHost
instead of cudaMalloc
and tried using cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking)
instead of cudaStreamCreate(&stream_)
. Below are example class functions showing how I use the functions. (note, I have different classes for the two engines and thus different context).
bool SampleInference::build() // called during the init
{
std::vector<char> trtModelStream_;
size_t size{ 0 };
std::ifstream file("engine1.trt", std::ios::binary);
if (file.good())
{
file.seekg(0, file.end);
size = file.tellg();
file.seekg(0, file.beg);
trtModelStream_.resize(size);
file.read(trtModelStream_.data(), size);
file.close();
}
IRuntime* runtime = createInferRuntime(sample::gLogger);
mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(trtModelStream_.data(), size, nullptr), samplesCommon::InferDeleter());
if (!mEngine)
{
return false;
}
context1 = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());
cudaMalloc(&buffers1[0], 3 * 512 * 512 * sizeof(float));
cudaMalloc(&buffers1[1], 150 * 512 * 512 * sizeof(float));
cudaMallocHost(&buffers2[0], 3 * 512 * 512 * sizeof(float));
cudaMallocHost(&buffers2[1], 150 * 512 * 512 * sizeof(float));
cudaStreamCreateWithFlags(&stream_, cudaStreamNonBlocking);
//cudaMalloc(&buffers2[1], 150 * 512 * 512 * sizeof(float));
//cudaMalloc(&buffers2[0], 3 * 512 * 512 * sizeof(float));
//cudaStreamCreate(&stream_);
return true;
}
//Function that performs inference
bool SampleInference::infer_enqueue(cv::Mat &inputs_fin)
{
//cudaStreamCreate(&stream_); //moved to build() which is called during init
input_img = inputs_fin;
bool status_processInput = processInput(input_img);
cudaMemcpyAsync(buffers1[0], (float*)buffers2[0], 3 * 512 * 512 * sizeof(float), cudaMemcpyHostToDevice, stream_);
bool status_inference = context1->enqueueV2(buffers1, stream_, nullptr);
return true;
}
cv::Mat SampleInferenceSEGMENTER::infer_dequeue()
{
cudaMemcpyAsync((float*)buffer2[1], buffers1[1], 150 * 512 * 512 * sizeof(float), cudaMemcpyDeviceToHost, stream_);
cudaStreamSynchronize(stream_);
//cudaStreamDestroy(stream_); // moved to a destroy function called in the end
cv::Mat output_fin = processOutput(input_img);
return output_fin;
}
Environment
TensorRT Version: 7.1.3.0
GPU Type: Jetson Nano (NVIDIA Tegra X1 (nvgpu)/integrated)
Nvidia Driver Version: L4T 32.4.4 [ JetPack 4.4.1 ]
CUDA Version: 10.2.89
CUDNN Version: 8.0.0.180
Operating System + Version: 18.04.6 LTS
OpenCV Version: 4.4.0
Steps To Reproduce
Run infer_enqueue function for the larger engine, then for the smaller engine and then run the infer_dequeue function for the smaller engine.