[SOLVED] Finding the maximum values with CUDA

Hello everyone, I want to make a question about the efficience of an algorithm that I implemented in the Jetson TK1 board. I’m very newbie in this so please be nice with me :).
I have a grey image that i divided in blocks . Inside of this blocks I must find the maximum value. I must do that extremely fast so I created one GPU block per block in the image in wich I need to find the maximum value and one thread per block in the GPU. My kernel is this.

__global__ void  Max(uchar image[],uchar result[],int width,int side)
{
    int col   = blockIdx.x*blockDim.x+threadIdx.x;
    int rows   = blockIdx.y*blockDim.y+threadIdx.y;
    int index = col + rows*width;
    int max=0;

    if(threadIdx.x==0 and threadIdx.y==0)
    {
        for(int i=rows;i<rows+side;i++)
        {
            for(int j=col+i*width;j<col+(i*width)+side;j++)
            {
                if(image[j]>max)
                {
                    max=image[j];
                    index=j;
                }
            }
        }

        result[index]=0;
    }
    else
        result[index]=image[index];
}

I declare the blocks and the threads

dim3 block(32,32);
dim3 thread(32,32);

I copy the image to the GPU and the result matrix

And I call the kernel in this way

Max<<<block,thread>>>(cuda_image,cuda_result,1024,32);

The code works but really I don’t know if I am doing well . I guess that my solution is slow and I think that I’m doing something wrong but I don’t know what.

Do I doing well?

Thanks!

Yes, this is bad. You are only using 1 thread per block. You should investigate parallel reduction techniques.

Here’s an article to get you started.

https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/

What you need is a parallel reduction operation over a 32x32 sized thread block (i.e. 1024 threads) using max() as the arithmetic operator.

note that most examples you find online use addition to demonstrate the concept of parallel reduction. However the concept works with any commutative binary operator (addition, multiplication, min, max, …)

Christian

Thanks txbob and cbuchner1 , I’m going to read the article.
Thanks!

Ok. I did it hehe (a week ago XD) . Yeah i needed doing a reduction in a matrix instead of a linear array like the paper of cbuchner1. I help to me too much think the matrix like a graph paper where each square is a pixel and one reduction is like folding to the half the paper and do the operation with the squares that are in front , and so on until finish with the rows (and left one column) and the do the same with the column (but folding vertically) until the result of the reduction.
I’m very happy!