PNPOLY approach faster in single thread CPU than CUDA Can someone tell me why this algorithm is fast

Hello to all,

My computer is a Quad Core Q6600 Intel processor and the graphics card Quadro 4000. Operating system Win7 64bits.
Visual Studio 2010 with CUDA 4.0.

I’m trying to port the PNPOLY algoritm that you can see here Point Inclusion in Polygon Test to CUDA.

The implementation is not jet finished, but what I have now is enough for this question:

I just created an structure with X,Y,Z coordinates of a closed polygon:

float coordinates =
{
721042.84 ,4504290.03 ,916.86,
721046.61 ,4504290.14 ,0.00,
721046.54 ,4504289.48 ,916.88,

};

and I modified the original algoritm like this:

int pnpoly2(int nvert, float vertex, float testx, float testy)
{
int i, j, c = 0;
for (i = 1, j = i-1; i < nvert; j = i++) {
if ( ((vertex[3
i+1]>testy) != (vertex[3j+1]>testy)) &&
(testx < (vertex[3
j]-vertex[3i]) * (testy-vertex[3i+1]) / (vertex[3j+1]-vertex[3i+1]) + vertex[3*i]) )
c = !c;
}
return c;
}

After that, I just created the next CUDA kernel:

global void pnpolyGPU(const float vertex, float testx, float testy, int results)
{
int id = blockIdx.x;
int indexOriginX = (blockIdx.x + 1) * 3;
int indexOriginY = (blockIdx.x + 1) * 3 + 1;
int indexDestinationX = blockIdx.x * 3;
int indexDestinationY = blockIdx.x * 3 + 1;

if ( ((vertex[indexOriginY]>testy) != (vertex[indexDestinationY]>testy)) && (testx < (vertex[indexDestinationX]-vertex[indexOriginX]) * (testy-vertex[indexOriginY]) / (vertex[indexDestinationY]-vertex[indexOriginY]) + vertex[indexOriginX]) )
	results[id] = 1;
else
	results[id] = 0;

}

After that, I calculate the maxmin of the polygon and then call to both implementations like this:

SYSTEMTIME st;
GetSystemTime(&st);
printf("The system time is: %02d:%02d:%02d\n", st.wHour, st.wMinute, st.wSecond);

for( float y=yMin; y<yMax; y+=2 ) {
	for( float x=xMin; x<xMax; x+=2 ) {
		pnpolyGPU<<<numberOfVertex - 1,1>>>(valuesGPU, x, y, resultsGPU);
	}
}

SYSTEMTIME st2;
GetSystemTime(&st2);

printf("The system time is: %02d:%02d:%02d\n", st2.wHour, st2.wMinute, st2.wSecond);

for( float y=yMin; y<yMax; y+=2 ) {
	for( float x=xMin; x<xMax; x+=2 ) {
		pnpoly2(numberOfVertex-1, coordinates, x, y);
	}
}

SYSTEMTIME st3;
GetSystemTime(&st3);

printf("The system time is: %02d:%02d:%02d\n", st3.wHour, st3.wMinute, st3.wSecond);

What is my surprise when I saw that the CPU implementation is faster than the CUDA implementation.
Why is the CPU implementation faster than the CUDA one?

Thank you in advance.
PnpolyCUDA.zip (30.7 KB)

Solved!

Using both blocks and threads the performance is boosting.