Low performance – Patch Match. Image Processing on GPU (CUDA)

Hello,

I have a performance problem: CPU and GPU performances are almost the same.

The Problem I Dealing with is PATCH MATCH. I Have 2 Matrices. I want to find where is the maximum similarity between the big matrix and the small one.

The Matrices has Binary values 0/1 (Black and White).

When I am checking a match between a small matrix to a big one with i5 CPU, it takes 30ms (using multithreading).

When I am checking a match between a small matrix to a big one in a Ge-force GT 730, it takes also 33ms.

I would expect that The GPU will work faster in at least 1 magnitude of order. I pretty disappointed from my current results.

I have two matrices:

  1. Big - 300000 (300 rows, 1000 columns)

  2. Small 50000 (50 rows, 1000 columns)

The comparing process is done by dividing the big matrix into 250 sub matrices and then comparing each one to the small matrix, then find highest similarity.

The Similarity criterion is the sum of corresponding black pixels on both matrices (the small and the sub-big) divided by the sum of black pixels on sub-big.

I did the last task using the following CUDA code:

__global__ void matCompare_cuda (uint8_t  *D_SUB , uint8_t  *D_SMALL ,  float *D_RSLTS , unsigned int step, int numOfIndentations ,int SUB_size, int SMALL_size)
{
    int  i = 0 , j = 0 , success = 0, sumDots = 0;    
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    int LoopIndex = ( tid * step );

    if (tid < numOfIndentations)            
    {
        for ( j = 0 ; j < (SMALL_size) ; j++)
            {
                i = j + LoopIndex;
                if ( D_SUB[i] == 0 )
                    {
                        {
                        sumDots++;
                        if ( D_SMALL[j] == 0 )                
                            success++;            
                        }
                    }
            }
        if (  success > 0 && sumDots > 500)
            D_RSLTS[tid] = 100*((float)success / sumDots) ;                 
    
    }
}

The Kernal launch:

int numOfIndentations = 300-50  //[ (big.row) - (small.row)]

int numBlock = 16;
int threadNumber = numOfIndentations/numBlock;

matCompare_cuda<<< numBlock , threadNumber >>> ( D_SUB , D_SMALL , D_RSLTS , step, numOfIndentations, SUB_size, SMALL_size );

The Cpu Code:

for (i=0; i < (pixelNum) ; i++)
    {    
        if (SUB[i]==0)
        {
            sumDots = sumDots +1;
            if (SMALL->Image[i]==0)
            {
                success = success + 1;
            }    
        }
    }


    if (success>0)
        if (sumDots>500)    
            RSLT=((float)success/sumDots)*100;

Do you see any improvement that can be done in the GPU code?

350-50/16 = 15 threads per block

Launching a kernel of 16 blocks with 15 threads per block is too small to take advantage of any GPU to the maximum extent.

You should aim for a problem size or organization that will allow you to launch 100 or more blocks with 128 threads per block or more, in multiples of 32.