I’m a programmer new to CUDA, looking to use GPGPU in a certain embarrassingly parallel problem I’m working on. I recently found out that my crappy integrated Geforce 8200 chipset actually supports CUDA, so I installed drivers for it, hooked up a second display, installed the SDK and examples, and actually got the example apps to run on the hardware.

Next, I tried following a tutorial to get a simple floating point benchmark running. It’s a program I write and use all the time, in various languages, using different compilers and optimization techniques, to see what kind of performance each system offers. The program is a (slow and inaccurate) Monte Carlo method of calculating pi, which is heavily dependent on floating point performance. I love this problem for benchmarking because it’s easy to write, and embarrassingly parallel.

So I wrote a version of this program using CUDA, and the API has been surprisingly easy to learn and use, I must say. However, I have run into a confusing problem. Whenever I run my application, the output from the GPU is always an array full of zeros. I’m not sure what I’m doing wrong here, and I would greatly appreciate the advice of people more experienced in this.

```
#include <iostream>
#include <stdio.h>
#include <cuda.h>
#include <ctime>
#include <cmath>
using namespace std;
__global__ void withinCircle(float* x, float* y, float* out, unsigned int num)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
out[idx] = sqrt((x[idx] * x[idx]) + (y[idx] * y[idx]));
}
int main()
{
unsigned int iterations = 100000000;
srand(time(NULL));
// Host pointers
float *randomX, *randomY, *out;
// Device pointers
float *gRandomX, *gRandomY, *gOut;
randomX = new float[iterations];
randomY = new float[iterations];
out = new float[iterations];
cudaMalloc((void**) &gRandomX, iterations);
cudaMalloc((void**) &gRandomY, iterations);
cudaMalloc((void**) &gOut, iterations);
for(unsigned int i = 0; i < iterations; i++)
{
randomX[i] = (rand() / (float)RAND_MAX);
randomY[i] = (rand() / (float)RAND_MAX);
}
cout << "Finished generating random input." << endl;
cudaMemcpy(gRandomX, randomX, sizeof(float)*iterations, cudaMemcpyHostToDevice);
cudaMemcpy(gRandomY, randomY, sizeof(float)*iterations, cudaMemcpyHostToDevice);
cudaMemcpy(gOut, out, sizeof(float)*iterations, cudaMemcpyHostToDevice);
unsigned int blockSize = 4;
unsigned int numBlocks = iterations/blockSize + (iterations % blockSize == 0 ? 0 : 1);
withinCircle<<< numBlocks, blockSize >>>(gRandomX, gRandomY, gOut, iterations);
cudaThreadSynchronize();
cudaMemcpy(out, gOut, sizeof(float)*iterations, cudaMemcpyDeviceToHost);
unsigned int hits = 0;
for(int i = 0; i < iterations; i++)
{
cout << out[i] << endl;
hits += (out[i] <= 1.0f) ? 1 : 0;
}
delete [] randomX;
delete [] randomY;
delete [] out;
cudaFree(gRandomX);
cudaFree(gRandomY);
cudaFree(gOut);
cout << (iterations/(float)hits)*4 << endl;
return 0;
}
```

Cheers.