New to CUDA, simple kernel gives output of zero

I’m a programmer new to CUDA, looking to use GPGPU in a certain embarrassingly parallel problem I’m working on. I recently found out that my crappy integrated Geforce 8200 chipset actually supports CUDA, so I installed drivers for it, hooked up a second display, installed the SDK and examples, and actually got the example apps to run on the hardware.

Next, I tried following a tutorial to get a simple floating point benchmark running. It’s a program I write and use all the time, in various languages, using different compilers and optimization techniques, to see what kind of performance each system offers. The program is a (slow and inaccurate) Monte Carlo method of calculating pi, which is heavily dependent on floating point performance. I love this problem for benchmarking because it’s easy to write, and embarrassingly parallel.

So I wrote a version of this program using CUDA, and the API has been surprisingly easy to learn and use, I must say. However, I have run into a confusing problem. Whenever I run my application, the output from the GPU is always an array full of zeros. I’m not sure what I’m doing wrong here, and I would greatly appreciate the advice of people more experienced in this.

#include <iostream>

#include <stdio.h>

#include <cuda.h>

#include <ctime>

#include <cmath>

using namespace std;

__global__ void withinCircle(float* x, float* y, float* out, unsigned int num)


	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	out[idx] = sqrt((x[idx] * x[idx]) + (y[idx] * y[idx]));


int main()


	unsigned int iterations = 100000000;



	// Host pointers

	float *randomX, *randomY, *out; 


	// Device pointers

	float *gRandomX, *gRandomY, *gOut;

	randomX = new float[iterations];

	randomY = new float[iterations];

	out = new float[iterations];

	cudaMalloc((void**) &gRandomX, iterations);

	cudaMalloc((void**) &gRandomY, iterations);

	cudaMalloc((void**) &gOut, iterations);

	for(unsigned int i = 0; i < iterations; i++)


		randomX[i] = (rand() / (float)RAND_MAX);

		randomY[i] = (rand() / (float)RAND_MAX);



	cout << "Finished generating random input." << endl;

	cudaMemcpy(gRandomX, randomX, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	cudaMemcpy(gRandomY, randomY, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	cudaMemcpy(gOut, out, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	unsigned int blockSize = 4;

	unsigned int numBlocks = iterations/blockSize + (iterations % blockSize == 0 ? 0 : 1);

	withinCircle<<< numBlocks, blockSize >>>(gRandomX, gRandomY, gOut, iterations);


	cudaMemcpy(out, gOut, sizeof(float)*iterations, cudaMemcpyDeviceToHost);

	unsigned int hits = 0;

	for(int i = 0; i < iterations; i++)


		cout << out[i] << endl;

		hits += (out[i] <= 1.0f) ? 1 : 0;



	delete [] randomX;

	delete [] randomY;

	delete [] out;





	cout << (iterations/(float)hits)*4 << endl;

	return 0;



EDIT: Oops, just realized this might have been better off in a different forum. I’m closing this topic and moving it to the general CUDA discussion forum. Sorry for the mix up.