Thrust::minmax_element slower than host implementation with OpenCV

I have an algorithm that requires me to compute the minimum and maximum multiple times over an image. The image is stored in a one-channel cv::cuda::GpuMat. The image is divided into a grid which in a particular case provides 64 blocks (8x8 grid). My host implementation would perform this in about 25ms to produce mins and maxs for each block. The code below is the host implementation. The function findMinMax would be called once for each block and the results stored in another array.

typedef struct min_max_t {
    uchar min;
    uchar max;
} min_max_t;

min_max_t findMinMax(cv::Mat block)
{
    min_max_t min_max;
    min_max.min = 255;
    min_max.max = 0;

    for(int i = 0; i < block.rows; ++i) {
        uchar *value_ptr = block.ptr<uchar>(i);

        for(int j = 0; j < block.cols; ++j) {
            if(value_ptr[j] < THRESH_DARK_PIXELS)
                continue;

            if(value_ptr[j] < min_max.min)
                min_max.min = value_ptr[j];
            if(value_ptr[j] > min_max.max)
                min_max.max = value_ptr[j];
        }

        if(min_max.min >= min_max.max)
            min_max.min = min_max.max;

        return min_max;
}

It’s straightforward. I was hoping to use the thrust api but the GpuMat implementation doesn’t translate too easily. I used this post in order to provide an iterator over the block. I made a typedef gpu_mat_itr which refers to those iterators. The code is as follows:

struct isNotDark
{
    __device__
    bool operator()(const uchar p)
    {
        return (p >= THRESH_DARK_PIXELS);
    }
};

min_max_t findMinMax(cv::cuda::GpuMat block)
{
    cv::cuda::GpuMat block_thresh;
    gpu_mat_itr_t<uchar> block_itr_begin, block_itr_end, block_thresh_itr_begin, block_thresh_itr_end;
    thrust::pair< gpu_mat_itr_t<uchar>, gpu_mat_itr<uchar> > thrust_min_max_result;
    min_max_t min_max;

    min_max.min = 255;
    min_max.max = 0;

    block_itr_begin = GpuMatBeginItr<uchar>(block);
    block_itr_end = GpuMatEndItr<uchar>(block);

    block_thresh = cv::cuda::GpuMat(block.size(). block.type());
    block_thresh_itr_begin = GpuMatBeginItr<uchar>(block_thresh);

    block_thresh_itr_end = thrust::copy_if(block_itr_begin, block_itr_end, block_thresh_itr_begin, isNotDark());
    thrust_min_max_result = thrust::minmax_element(block_thresh_itr_begin, block_thresh_itr_end);

    min_max.min = *(thrust_min_max_result.first);
    min_max.max = *(thrust_min_max_result.second);

    if(min_max.min >= min_max.max)
        min_max.min = min_max.max;

    return min_max;
}

This code takes 130-150ms to produce min_max for all 64 blocks. I figured it might be the fact that I’m copying the data to another array so I removed it to see if performance might improve. I know this wouldn’t produce results I want but the performance only improved slightly to around 100ms. Is this because I’m using that iterator? Is the performance bad because it’s not explicitly stored in a thrust::device_vector? Is there any way to do this so that I can at least get something similar to the performance of my host implementation at about 25ms?

I’m currently trying to do my own kernel implementation but I’m struggling to wrap my head around reductions solutions. I figure the best approach would be to do reduction for each row and then do reduction for all resulting rows together to get the min max for a block. But this has been very difficult for me to understand and implement.

I’m on the Jetson Nano if that’s important.

  1. What are the image dimensions for the performance numbers you have reported?
  2. I assume your host performance number is also for host code running on jetson nano. Is that true?
  3. I assume you are calling your thrust version of findMinMax in a loop, iterating over the 64 blocks. Is that true?
  4. I assume you are not compiling with the -G switch. Is that true?
  1. 1600x1200 so each block is 200x150
  2. Yes. Host implementation is on jetson nano too. In my actual processing, I have more than 10x speed up so that’s great. Just this min max calculation is causing problems. I timed the host implementation with std::chrono::high_resolution_clock.
  3. Yes. The implementation looks like this:
min_max_matrix_t makeMinMaxMatrix(cv::cuda::GpuMat channel, int *block_rows, int *block_cols)
{
    min_max_matrix_t min_max_matrix;

    //allocate memory for min_max_matrix

    //8x8 grid
    for(int y = 0; y < 8; ++y) {
        for(int x = 0; x < 8; ++x) {
            cv::rect roi(x * (*block_cols) y * (*block_rows), *block_cols, *block_rows);
            cv::cuda::GpuMat block = cv::cuda::GpuMat(channel, roi);

            min_max_matrix[y][x] = findMinMax(block);
        }
    }
}

block_cols and block_rows are the dimensions of each grid block which are 150 and 200. The host implementation is the same but with a different findMinMax function and using cv::Mat.
4. Yes I am not compiling with -G.

I assume there is a comma missing in there. Anyhoo. When working on the GPU we’d like to:

  • not do unnecessary data copying
  • avoid things like allocations in performance loops

With that in mind, let’s set out to do 2 things:

  1. Work directly from your starting GpuMat, not make block copies
  2. Get the allocations out of the inner function

I’d suggest something like this (coded in browser, not tested):

struct idx_convert
{
int blocks_x, block_cols, block_rows, x, y;
idx_convert(int _x, int _y, int _blocks_x, int _block_cols,  int _block_rows) { x = _x; y = _y; block_cols = _block_cols; block_rows = _block_rows, blocks_x = _blocks_x;};
__host__ __device__
int operator()(int idx) {
  int row = idx/block_cols + y*block_rows;
  int col = (idx - (row*block_cols)) + x*block_cols;
  return row*block_cols*blocks_x + col;}
};
min_max_t findMinMax(cv::cuda::GpuMat &image, thrust::device_vector<unsigned char> &block_thresh, int x, int y, int blocks_x, int block_cols, int blocks_y, int block_rows)
{
    gpu_mat_itr_t<uchar> block_itr_begin, block_itr_end;
    thrust::device_vector<uchar>::iterator block_thresh_itr_begin, block_thresh_itr_end;
    thrust::pair< thrust::device_vector<uchar>::iterator, thrust::device_vector<uchar>::iterator > thrust_min_max_result;
    min_max_t min_max;

    min_max.min = 255;
    min_max.max = 0;

    block_itr_begin = thrust::make_permutation_iterator(GpuMatBeginItr<uchar>(image), thrust::transform_iterator(thrust::counting_iterator<int>, idx_convert(x, y, blocks_x, block_cols,  block_rows)));
    block_itr_end = block_itr_begin + block_cols*block_rows;


    block_thresh_itr_begin = block_thresh.begin();

    block_thresh_itr_end = thrust::copy_if(block_itr_begin, block_itr_end, block_thresh_itr_begin, isNotDark());
    thrust_min_max_result = thrust::minmax_element(block_thresh_itr_begin, block_thresh_itr_end);

    min_max.min = *(thrust_min_max_result.first);
    min_max.max = *(thrust_min_max_result.second);

    if(min_max.min >= min_max.max)
        min_max.min = min_max.max;

    return min_max;
}

min_max_matrix_t makeMinMaxMatrix(cv::cuda::GpuMat &channel, int *block_rows, int *block_cols)
{
    min_max_matrix_t min_max_matrix;

    //allocate memory for min_max_matrix
    thrust::device_vector<unsigned char> temp((*block_rows)*(*block_cols));
    //8x8 grid
    const int blocks_x = 8;
    const int blocks_y = 8;
    for(int y = 0; y < blocks_y; ++y) {
        for(int x = 0; x < blocks_x; ++x) {
            min_max_matrix[y][x] = findMinMax(channel, temp, x, y, blocks_x, *block_cols, blocks_y, *block_rows );
        }
    }
}

I don’t think producing the block cv::Mat is copying data. I may be wrong when it comes to cv::cuda::GpuMat because there’s no description in their documentation. But under cv::Mat notes, it does say for parameter m that no data is copied when taking the roi. To my understanding, this github issue is how the roi is implemented. It just sets the starting point and other parameters differently so with abstraction it’s like it’s a different copy but it’s definitely the same. I could be wrong about it for cv::cuda::GpuMat since there are no notes, but I don’t think so. If anything, if the behavior was different, I feel like they would have definitely mentioned that in the documentation.

EDIT: From their source code for cv::cuda::GpuMat, it definitely doesn’t look like there is any copying. It looks similar to the source for cv::Mat.

Also, as I look at your code, I definitely had some error when I tried to use a thrust::device_vector with the GpuMatItr. The error was hard to parse so I just stuck with using a GpuMatItr for the copy. Is this set of code what you are referring to when you said

OK I didn’t study the GpuMat class so it all may be a waste of time. I did run something like your code on the smallest GPU I have (GT640) and it ran in ~10ms, not 130ms. You might want to build a simple test case, and profile it, to confirm first where the time is being spent.

Okay, so with some profiling of individual stuff, I found that allocating block_thresh with cv::cuda::GpuMat for each block was causing significant delay of about 50ms. I moved it to a parameter and that improved it. I changed the block_thresh to device_vector like you did but I’m not sure if that improved much. The performance bottleneck is still in copy_if and minmax_element. copy_if typically takes 20-30ms and minmax_element typically takes 30-40ms. These values are the aggregate so copy_if is 20-30 ms for all 64 blocks. Overall performance is now around 65-75ms.

The modified code is below. block_thresh is allocated once in the calling function and passed to the function like you did.

min_max_t findMinMax(cv::cuda::GpuMat block, thrust::device_vector<uchar> &block_thresh)
{
    gpu_mat_itr_t<uchar> block_itr_begin, block_itr_end;
    thrust::device_vector<uchar>::iterator block_thresh_itr_begin, block_thresh_itr_end;
    thrust::pair< thrust::device_vector<uchar>::iterator, thrust::device_vector<uchar>::iterator > thrust_min_max_result;

    min_max.min = 255;
    min_max.max = 0;

    block_itr_begin = GpuMatBeginItr<uchar>(block);
    block_itr_end = GpuMatEndItr<uchar>(block);

    //timing for copy_if surrounds these two statements and added to a global value
    block_thresh_itr_begin = block_thresh.begin();
    block_thresh_itr_end = thrust::copy_if(block_itr_begin, block_itr_end, block_thresh_itr_begin, isNotDark());

    //timing for minmax_element surround this statement and added to a global value
    thrust::min_max_result = thrust::minmax_element(block_thresh_itr_begin, block_thresh_itr_end);

    min_max.min = *(thrust_min_max_result.first);
    min_max.max = *(thrust_min_max_result.second);

    if(min_max.min >= min_max.max)
        min_max.min = min_max.max;

    return min_max;
}

I have never used one of NVIDIA’s embedded platforms, so I don’t have a good feeling for what kind of performance to expect. Which is the reason I usually direct questions regarding an embedded platform immediately to the dedicated subforums, because that is where the experts for those can be found: https://forums.developer.nvidia.com/c/agx-autonomous-machines/jetson-embedded-systems/70

Robert Crovella stated that he observed execution time on the order of 10 milliseconds on a GT 640, which has 384 CUDA cores, whereas the Jetson Nano has only 128 CUDA cores. The Jetson Nano is apparently clocked at 640 MHz, whereas the GT640 has a base clock of 900 MHz, and likely can boost up to 1200 MHz or so. So overall there could be a performance difference of 5-6x in terms of computational throughput, which would be roughly in line with the run time you observed.

Note that I have not looked at memory throughput aspects, which may have a significant impact on this use case. My understanding is that in Jetson products, the CPU and GPU share the same physical memory, and that this is low-throughput memory. In this case, it seems that the GPU code is traversing memory twice, but the CPU code is traversing it only once. This would cause the GPU code to take roughly twice as long as the CPU code assuming both are limited by memory throughput. Note that I have not studied this thread closely so my assessment may be off. You may want to take closer look with the profiler to establish the bottleneck(s) in the code.

My GT640 (in deviceQuery) lists a max GPU boost clock of 1.05GHz. The bandwidth number reported by bandwidthTest is 33.4GB/s. Jetson Nano is certainly lower since I believe the max memory bandwidth (peak theoretical) is ~25GB/s

My guess is that this code is generally mem bw bound.

However since Jetson has a unified memory resource, I’m hard pressed to come up with an explanation why a CUDA kernel cannot do at least as well as the same work being done in host code especially for a mem bw bound code. If you run an algorithm in host code that takes 25ms, I don’t know of any reason why you should not be able to achieve a similar result with a well written device code implementation. So I suspect there are still gains to be made here.

The Jetson Nano has a single SM, so it only requires 2048 threads (LoL) to saturate the GPU. The block size here is 200x150 = 30,000, so the thrust ops should be launching ~30,000 threads, so that base is covered. There are only 64 blocks so the cumulative kernel launch overhead should not be a big part of the ~60ms, I don’t think. The GpuMat thrust iterator construct does do at least an integer divide per thread, so if compute were the issue we could probably do better than that by dispensing with thrust and using well-crafted 2D algorithms. But this seems unlikely to me to cause such a big difference.

Unfortunately I don’t have a convenient setup at the moment that has the CUDA version of OpenCV installed, so I’m not able to run this code directly/exactly myself and inspect.

If you have a Jetson Nano, you can take my installation. With the package_opencv flag, it gave me a script that has a tar file in it. Here’s the script to download. (I haven’t actually tested this on a fresh Nano so I hope it works.) I just install it with sudo ./OpenCV-*-aarch64.sh --prefix=/usr/local/ --exclude-subdir. This is Opencv4.4 using mdegans script with some extra flags.

I was thinking about this more today and I thought maybe I could at least improve performance by doing the copy_if and minmax_element asynchronously. Something like:

stream1: block1 copy_if | block2 copy_if | block3 copy_if | ...
stream2:                  block1 minmax  | block2 minmax  | block3 minmax  | ...

I remember seeing something like this in one of the CUDA presentations, but I’m not sure if this is a situation where it is going to help. Do you think this would improve performance? Although, I’m not sure if and how this is possible to implement. Can I just add thrust::cuda::par to the derived policy argument? And then I also am not sure how to synchronize the streams like that.

That should definitely be the case, but that is not what we have when using the two Thrust calls above, which would appear to make two passes (one each for copy_if, minmax_element) over the data. Am I missing something?