PI evaluator

Hello.

Im trying to assemble a simple pi evaluator. Although i’ve understood this is not a good starter project, ive decided to give it a try. The main idea is to evaluate as many sample points in 0,0-1,1 as possible by determining whether or not their absolute value is less or equal to 1. performing a scan of the generated vector, we should have a ratio of ~0.78 (1/4 pi), hits/length, for many samples. The main problem here ofcourse is to get random values with precision high enough to yield new decimals, also the values have to be switched against new ones after each loop. As CUDA doesnt have any rand possiblities and i havnt taken advantage of mersenne twister yet, im creating the rand values offline and transfer them for each kernel run. I understand that this solution will cripple the performance to pathetic levels. However, ive bumped into problems, it seems that in Debug mode, running this kernel ONCE.

global void evaluatePI(float *g_odata, float *x, float *y, int n)
{

extern  __shared__  int temp[];

int thid = threadIdx.x;

if(sqrt(x[thid]*x[thid]+y[thid]*y[thid])<=1.0f)
	temp[thid] = 1;
else
	temp[thid] = 0;

__syncthreads();

g_odata[thid] += temp[thid];

}

Using += in the assignment gives me a g_odata vector with elements not equal to 0 or 1, but 0 or 56 (even if the kernel is just executed once). the g_odata is cleared with cuda memset to 0 before beginning ofcourse. if just = is used values are as expected. Running this example in EmuDebug gives me expected values when using +=. What could be the reason this doesnt work in Debug?

This code calls the kernel. Please not that this is ofcourse not intended to be a serious pi evaluator as the floating point precision is limited to three decimals :P

unsigned int numIterations = 10;
for (int i = 0; i < numIterations; ++i)
{
	for( unsigned int i = 0; i < 256; ++i) 
	{
		h_xdata[i] = (rand()%1000)/1000.0f;
		h_ydata[i] = (rand()%1000)/1000.0f;
	}
	CUDA_SAFE_CALL( cudaMemcpy( d_ixdata, h_xdata, mem_size, cudaMemcpyHostToDevice) );
	CUDA_SAFE_CALL( cudaMemcpy( d_iydata, h_ydata, mem_size, cudaMemcpyHostToDevice) );

	evaluatePI<<< grid, threads, 2 * shared_mem_size  >>>
		(d_odata, d_ixdata, d_iydata, num_elements);
}

Please if you can apart from the problem, elaborate what could have been done to speed up, how much exactly does it cost to copy back and fourth from the device? (cudaMemcpyDeviceToHost, cudaMemcpyHostToDevice).

Thank you so much for any input here.

Ola