Execution Problem Emulation & < 100000 working

Hello all,

I am working on 1000000 points of data sets and I am weird problem in execution, Code works just fine in emulation mode and when the points are equal to 100000, but it won’t work for 1000000.

__global__ void cuda_delaunay_func(int blocks, block_properties *block_p)

{

	int bid = blockIdx.x;

	long int i;

	int m, flag;

	float xn, yn, zn;

	long int F=0;

	long int index = 0, end_index = 0;

	long int roller=0;

	index = block_p[bid].index;

	end_index = block_p[bid].end_index;

	for(i=index; i< end_index - 2; i++)

	{

		roller++;

	}

	block_p[bid].no_of_loops = roller;

	block_p[bid].no_of_tri = F;

}

Structure for block_properties is

struct block_properties

{

	unsigned int siteidx, counter, deltay, deltax;

	long int no_of_tri, no_of_loops;

	long int t1, t2, t3, t4;

	long int l1, l2, l3, l4;

	long int index;

	long int end_index;

};

I don’t know how to find the error. And while executing the program it is not giving any error just skipping the above function and execute the other functions.

And, I am using Ubuntu 8.04 with Tesla C870, can I use the cuda debugger??

Thanks,

Sambi

Debugger doesn’t work on C870 (and I’ve no idea if it works with Ubuntu 8.04 or not).

Are you checking error codes?

It doesn’t give me any error while executing,

when I try to copy the block_properties to host and try to print all I get is zero’s.

are you checking to see what cudaThreadSynchronize returns?

Keep in mind SDK error-checking functions only work in Debug builds.

Yes, I get unspecified launch failure for the following function which is before the function given above. And the same error occurs for the function given above.

__global__ void move_points(point2 *pData, uint elements,	limit *cuda_values, int blocks, delaunay_struct *cu_de,	block_properties *block_p)

{

	unsigned int index=0, end_index;

	int bid = blockIdx.x;

	int i;

	for (i=0; i< bid; i++)

	{

		index = index + block_p[i].counter + 3;

	}

	index ++;

	end_index = index + block_p[bid].counter;

	block_p[bid].siteidx = index;

	block_p[bid].index = index;

	block_p[bid].end_index = end_index;

	block_p[bid].t1 = block_p[bid].t2 = block_p[bid].t3 = block_p[bid].t4 = 0;

	block_p[bid].l1 = block_p[bid].l2 = block_p[bid].l3 = block_p[bid].l4 = 0;

	for (i=0; i< elements; i++)

	{

		if(bid == 0)

		{

			if(pData[i].y>=cuda_values[bid].bottom && pData[i].y <= cuda_values[bid+1].top)

			{

				cu_de[index].x = pData[i].x;

				cu_de[index].y = pData[i].y;

				cu_de[index].z = cu_de[index].x*cu_de[index].x + cu_de[index].y * cu_de[index].y;

				index++;

			}

		}

		else if(bid == (blocks-1))

		{

			if(pData[i].y>=cuda_values[bid-1].bottom && pData[i].y <= cuda_values[bid].top)

			{

				cu_de[index].x = pData[i].x;

				cu_de[index].y = pData[i].y;

				cu_de[index].z = cu_de[index].x*cu_de[index].x + cu_de[index].y * cu_de[index].y;

				index++;

			}

		}

		else

		{

			if(pData[i].y>=cuda_values[bid-1].bottom && pData[i].y <= cuda_values[bid+1].top)

			{

				cu_de[index].x = pData[i].x;

				cu_de[index].y = pData[i].y;

				cu_de[index].z = cu_de[index].x*cu_de[index].x + cu_de[index].y * cu_de[index].y;

				index++;

			}

		}

	}

}

Why isn’t there anything about cudaThreadSynchronize in Programming Guide 2.0?

It is explained. Go read 4.5.15.

Also unspecified launch failure = you have a segfault. Run your kernel through valgrind in emulation.