Hello,

I have a urge trouble in simple morphology program.

I use the nppiErode_8u_C1R function to erode a gray image and I do the same think on CPU.

On HD images, the GPU takes ~25ms and on the CPU it takes 3ms…

Why is there that hude diffence? Should not the GPU be faster thant CPU?

I just had some information. This is my kernel version of erode:

```
////////////////////////////////////////////////////////////////////////////////////
/// Local copy of the Image +size of the kernel and the kernel coordinate
///
__shared__ u_int8_t LocalImage[(BLOCK_SIZE_X+kernelSize_X)*(BLOCK_SIZE_Y+kernelSize_Y)];
__shared__ int16_t LocalKernelCoord[kernelSize_X*kernelSize_Y];
int x = blockIdx.x*blockDim.x;
int y = blockIdx.y*blockDim.y;
for(int i=threadIdx.x;i<BLOCK_SIZE_X+kernelSize_X && (x+i)<w;i=i+blockDim.x)
{
for(int j=threadIdx.y;j<BLOCK_SIZE_Y+kernelSize_Y && (y+j)<h;j=j+blockDim.y)
{
LocalImage[i+j*(BLOCK_SIZE_X+kernelSize_X)] = (u8_ImageIn)[x+i + (y+j)*w];
}
}
if(threadIdx.x + threadIdx.y*BLOCK_SIZE_X <isizeCoord)
{
int ptrk = threadIdx.x + threadIdx.y*BLOCK_SIZE_X;
LocalKernelCoord[ptrk] = i16_Kernel[ptrk];
}
syncthreads();
////////////////////////////////////////////////////////////
int iMidX = (kernelSize_X-1)/2;
int iMidY = (kernelSize_Y-1)/2;
int xglobal = x+threadIdx.x+iMidX;
int yglobal = y+threadIdx.y+iMidY;
if(/*xglobal <= kernelSize_X/2 ||*/ xglobal >= w-kernelSize_X/2 || /*yglobal <= kernelSize_Y/2 ||*/ yglobal >= h-kernelSize_Y/2)
{
// Change the output pixel to 0
return;
}
// /*if(threadIdx.x>kernelSize_X || threadIdx.y>kernelSize_Y)
// return;*/
int ptrl = threadIdx.x+iMidX + (threadIdx.y+iMidY) *(BLOCK_SIZE_X+kernelSize_X);
bool IsErode = false;
if(LocalImage[ptrl] ==0)
{
u8_ImageOut[xglobal+yglobal*w] = 0;
return;
}
for(int k=0;k<isizeCoord;k++)
{
if(LocalImage[ptrl +i16_Kernel[k]] == 0)
{
IsErode = true;
break;
}
//float val;
//atomicAdd(&val, LocalImage[ptrl+i16_Kernel[k]] );
//__iAtomicAdd(&valeur,5);
}
if(IsErode == true)
{
u8_ImageOut[xglobal+yglobal*w] = 0;//u8_ImageIn[xglobal+yglobal*w];
}else
{
u8_ImageOut[xglobal+yglobal*w] = 255;//u8_ImageIn[xglobal+yglobal*w];
}
```

The kernel is an ellipsoide kernel so it is not separable. This version takes ~35ms and on opencv ~3ms.

I don’t understand why…