Hello,
I have a performance problem: CPU and GPU performances are almost the same.
The Problem I Dealing with is PATCH MATCH. I Have 2 Matrices. I want to find where is the maximum similarity between the big matrix and the small one.
The Matrices has Binary values 0/1 (Black and White).
When I am checking a match between a small matrix to a big one with i5 CPU, it takes 30ms (using multithreading).
When I am checking a match between a small matrix to a big one in a Geforce GT 730, it takes also 33ms.
I would expect that The GPU will work faster in at least 1 magnitude of order. I pretty disappointed from my current results.
I have two matrices:

Big  300000 (300 rows, 1000 columns)

Small 50000 (50 rows, 1000 columns)
The comparing process is done by dividing the big matrix into 250 sub matrices and then comparing each one to the small matrix, then find highest similarity.
The Similarity criterion is the sum of corresponding black pixels on both matrices (the small and the subbig) divided by the sum of black pixels on subbig.
I did the last task using the following CUDA code:
__global__ void matCompare_cuda (uint8_t *D_SUB , uint8_t *D_SMALL , float *D_RSLTS , unsigned int step, int numOfIndentations ,int SUB_size, int SMALL_size)
{
int i = 0 , j = 0 , success = 0, sumDots = 0;
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int LoopIndex = ( tid * step );
if (tid < numOfIndentations)
{
for ( j = 0 ; j < (SMALL_size) ; j++)
{
i = j + LoopIndex;
if ( D_SUB[i] == 0 )
{
{
sumDots++;
if ( D_SMALL[j] == 0 )
success++;
}
}
}
if ( success > 0 && sumDots > 500)
D_RSLTS[tid] = 100*((float)success / sumDots) ;
}
}
The Kernal launch:
int numOfIndentations = 30050 //[ (big.row)  (small.row)]
int numBlock = 16;
int threadNumber = numOfIndentations/numBlock;
matCompare_cuda<<< numBlock , threadNumber >>> ( D_SUB , D_SMALL , D_RSLTS , step, numOfIndentations, SUB_size, SMALL_size );
The Cpu Code:
for (i=0; i < (pixelNum) ; i++)
{
if (SUB[i]==0)
{
sumDots = sumDots +1;
if (SMALL>Image[i]==0)
{
success = success + 1;
}
}
}
if (success>0)
if (sumDots>500)
RSLT=((float)success/sumDots)*100;
Do you see any improvement that can be done in the GPU code?