Low performance – Patch Match. Image Processing on GPU (CUDA)

Yosef · January 4, 2018, 3:18pm

Hello,

I have a performance problem: CPU and GPU performances are almost the same.

The Problem I Dealing with is PATCH MATCH. I Have 2 Matrices. I want to find where is the maximum similarity between the big matrix and the small one.

The Matrices has Binary values 0/1 (Black and White).

When I am checking a match between a small matrix to a big one with i5 CPU, it takes 30ms (using multithreading).

When I am checking a match between a small matrix to a big one in a Ge-force GT 730, it takes also 33ms.

I would expect that The GPU will work faster in at least 1 magnitude of order. I pretty disappointed from my current results.

I have two matrices:

Big - 300000 (300 rows, 1000 columns)
Small 50000 (50 rows, 1000 columns)

The comparing process is done by dividing the big matrix into 250 sub matrices and then comparing each one to the small matrix, then find highest similarity.

The Similarity criterion is the sum of corresponding black pixels on both matrices (the small and the sub-big) divided by the sum of black pixels on sub-big.

I did the last task using the following CUDA code:

__global__ void matCompare_cuda (uint8_t  *D_SUB , uint8_t  *D_SMALL ,  float *D_RSLTS , unsigned int step, int numOfIndentations ,int SUB_size, int SMALL_size)
{
    int  i = 0 , j = 0 , success = 0, sumDots = 0;    
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    int LoopIndex = ( tid * step );

    if (tid < numOfIndentations)            
    {
        for ( j = 0 ; j < (SMALL_size) ; j++)
            {
                i = j + LoopIndex;
                if ( D_SUB[i] == 0 )
                    {
                        {
                        sumDots++;
                        if ( D_SMALL[j] == 0 )                
                            success++;            
                        }
                    }
            }
        if (  success > 0 && sumDots > 500)
            D_RSLTS[tid] = 100*((float)success / sumDots) ;                 
    
    }
}

The Kernal launch:

int numOfIndentations = 300-50  //[ (big.row) - (small.row)]

int numBlock = 16;
int threadNumber = numOfIndentations/numBlock;

matCompare_cuda<<< numBlock , threadNumber >>> ( D_SUB , D_SMALL , D_RSLTS , step, numOfIndentations, SUB_size, SMALL_size );

The Cpu Code:

for (i=0; i < (pixelNum) ; i++)
    {    
        if (SUB[i]==0)
        {
            sumDots = sumDots +1;
            if (SMALL->Image[i]==0)
            {
                success = success + 1;
            }    
        }
    }


    if (success>0)
        if (sumDots>500)    
            RSLT=((float)success/sumDots)*100;

Do you see any improvement that can be done in the GPU code?

Robert_Crovella · January 4, 2018, 3:52pm

350-50/16 = 15 threads per block

Launching a kernel of 16 blocks with 15 threads per block is too small to take advantage of any GPU to the maximum extent.

You should aim for a problem size or organization that will allow you to launch 100 or more blocks with 128 threads per block or more, in multiples of 32.

Topic		Replies	Views
matrix multiplication CUDA Programming and Performance	10	3968	March 7, 2010
GPU is slower than CPU CUDA Programming and Performance	13	18554	November 4, 2025
matchTemplate use OpenCV matchTemplate with GPU CUDA Programming and Performance	0	12064	April 11, 2011
matrix multiplication with large dimensions CUDA Programming and Performance	7	1697	April 9, 2011
Shared Memory Limitation CUDA Programming and Performance	8	4896	February 4, 2009
matrixMul problem printDiff is flipping plz help me CUDA Programming and Performance	2	6781	October 15, 2007
Maybe someone can help CUDA Programming and Performance	2	2744	February 13, 2008
beginner question Checking if GPU is the answer to me CUDA Programming and Performance	20	12251	September 4, 2008
Possible compiler bug CUDA Programming and Performance	2	1853	August 14, 2008
GPUs, How do they work? Suspiciously fast matrix dot product execution CUDA Programming and Performance	3	2216	July 7, 2011

Low performance &ndash; Patch Match. Image Processing on GPU (CUDA)

Related topics

Low performance – Patch Match. Image Processing on GPU (CUDA)