OpenCV CUDA Streams do not execute in parallel

fabian.solano · May 20, 2020, 5:06pm

Hi all,

Currently I am working on a project that uses OpenCV with CUDA support to perform some operations (warpPerspective, warpAffine, resize, among others). I am trying to use Streams to execute several computations in parallel, the problem is that even when using streams the whole process runs serialized.

OpenCV 4.1 with CUDA support
CUDA 10.2
GeForce GTX 1050 Ti Mobile (final program will also be used in Jetson platforms)
NVIDIA Nsight Systems 2020.1.1

Here is a minimal example that can be used to reproduce the problem:



#include "opencv2/opencv.hpp"
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/core/cuda.hpp>

#include <vector>
#include <memory>
#include <iostream>
#include <chrono>
#include <ctime>

cv::Mat doRotate(cv::Mat cvInput, double angle, cv::cuda::Stream stream){

    cv::Point2f center((cvInput.cols - 1) / 2.0, (cvInput.rows - 1) / 2.0);
    cv::Mat rot = cv::getRotationMatrix2D(center, angle, 1.0);

    //Determine bounding rectangle, center not relevant.
    cv::Rect2f bbox = cv::RotatedRect(cv::Point2f(), cvInput.size(),
                                      angle).boundingRect2f();

    //Adjust transformation matrix.
    rot.at<double>(0, 2) += bbox.width / 2.0 - cvInput.cols / 2.0;
    rot.at<double>(1, 2) += bbox.height / 2.0 - cvInput.rows / 2.0;

    cv::Rect2f rBox = cv::RotatedRect(cv::Point2f(), cvInput.size(),angle).boundingRect2f();

    cv::Mat output;

    cv::cuda::HostMem cudaMemSrc = cv::cuda::HostMem(cvInput, cv::cuda::HostMem::PAGE_LOCKED);
    cv::cuda::HostMem cudaMemDst = cv::cuda::HostMem(output, cv::cuda::HostMem::PAGE_LOCKED);

    cv::cuda::GpuMat src, dst;
    
    src.upload(cudaMemSrc,stream);
    cv::cuda::warpAffine(src, dst, rot, rBox.size(),0,0,cv::Scalar(),stream);


    dst.download(output,stream);

    return output;
}

std::shared_ptr<std::vector<cv::Mat>> computeArray(std::shared_ptr<std::vector<cv::Mat>> inputArray,
                                                   std::shared_ptr<std::vector<cv::cuda::Stream>> streamsArray){


    std::vector<cv::Mat> *array = inputArray.get();

    std::shared_ptr<std::vector<cv::Mat>> outputArray = std::make_shared< std::vector<cv::Mat> >();

    //#pragma omp parallel for
    for(unsigned int i=0; i<array->size(); i++){

        cv::Mat result = doRotate((*array)[i],180,(*streamsArray)[i]);

        outputArray->push_back(result);
    }

    (*streamsArray)[0].waitForCompletion();
    (*streamsArray)[1].waitForCompletion();
    (*streamsArray)[2].waitForCompletion();
    (*streamsArray)[3].waitForCompletion();

    return outputArray;

}

int main (int argc, char* argv[]){

    std::shared_ptr<std::vector<cv::cuda::Stream>> streamsArray = std::make_shared<std::vector<cv::cuda::Stream>>();

    cv::cuda::Stream streamA, streamB, streamC, streamD;
    streamsArray->push_back(streamA);
    streamsArray->push_back(streamB);
    streamsArray->push_back(streamC);
    streamsArray->push_back(streamD);


    for(int i=0; i<20; i++){
        try{
            cv::Mat src_host = cv::imread("1080.jpg");

            std::shared_ptr<std::vector<cv::Mat>> images = std::make_shared<std::vector<cv::Mat>>();

            images->push_back(src_host);
            images->push_back(src_host);
            images->push_back(src_host);
            images->push_back(src_host);

            std::shared_ptr<std::vector<cv::Mat>> result = std::make_shared<std::vector<cv::Mat>>();

            result = computeArray(images, streamsArray);

            //cv::imshow("Result", (*result)[0]);
            //cv::waitKey(0);
        }
        catch(const cv::Exception& ex){
            std::cout << "Error: " << ex.what() << std::endl;
        }
    }

    return 0;
}

As seen in the previous image the 4 streams are detected by Nsight but they still execute in serial.

Also if the following line is uncommented:

#pragma omp parallel for

Result is different and HtoD operations seem to happen first, but the overall is still serialized.

Things I have tried so far:

1- Compile OpenCV with -DCUDA_NVCC_FLAGS=“–default-stream per-thread;” option
2- Use pinned vs shared memory
3- Use global allocated image (to avoid allocating memory each iteration)

Any help or hint will be appreciated

Regards,
Fabian

fabian.solano · May 27, 2020, 6:43pm

After a lot of debugging I was able to find the problem. It is related to the memory allocation process, and how everything must be reserved previous to each iteration.

You can find full working example here: OpenCV CUDA Streams | OpenCV CUDA | RidgeRun Developer

The result is the pipelined execution where kernel and MemcpyDtoH work in parallel as expected, without gaps between executions of each kernel

Regards,
Fabian
www.ridgerun.com

Topic		Replies	Views
Unable to run concurrent opencv cuda functions through Streams CUDA Programming and Performance opencv , cuda	2	1446	May 28, 2021
Using OpenCV cuda stream for parallel CPU and GPU execution CUDA Programming and Performance opencv	3	1659	December 1, 2022
Opencv cuda stream optimization CUDA Programming and Performance opencv , cuda	0	902	August 16, 2022
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	732	March 8, 2023
Cannot get any stream parallelism. CUDA Programming and Performance	13	1471	December 31, 2019
Cuda with openMP CUDA Programming and Performance	10	15897	June 13, 2010
Can't overlap streams CUDA Programming and Performance	2	993	December 16, 2021
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1582	September 14, 2017
My streams are not running concurrently CUDA Programming and Performance	7	1906	March 6, 2018
Problem using cuda streams CUDA Programming and Performance	13	2176	March 4, 2015

OpenCV CUDA Streams do not execute in parallel

Related topics