New to CUDA, simple kernel give output of zero.

I’m a programmer new to CUDA, looking to use GPGPU in a certain embarrassingly parallel problem I’m working on. I recently found out that my crappy integrated Geforce 8200 chipset actually supports CUDA, so I installed drivers for it, hooked up a second display, installed the SDK and examples, and actually got the example apps to run on the hardware.

Next, I tried following a tutorial to get a simple floating point benchmark running. It’s a program I write and use all the time, in various languages, using different compilers and optimization techniques, to see what kind of performance each system offers. The program is a (slow and inaccurate) Monte Carlo method of calculating pi, which is heavily dependent on floating point performance. I love this problem for benchmarking because it’s easy to write, and embarrassingly parallel.

So I wrote a version of this program using CUDA, and the API has been surprisingly easy to learn and use, I must say. However, I have run into a confusing problem. Whenever I run my application, the output from the GPU is always an array full of zeros. I’m not sure what I’m doing wrong here, and I would greatly appreciate the advice of people more experienced in this.

#include <iostream>

#include <stdio.h>

#include <cuda.h>

#include <ctime>

#include <cmath>

using namespace std;

__global__ void withinCircle(float* x, float* y, float* out, unsigned int num)


	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	out[idx] = sqrt((x[idx] * x[idx]) + (y[idx] * y[idx]));


int main()


	unsigned int iterations = 100000000;



	// Host pointers

	float *randomX, *randomY, *out; 


	// Device pointers

	float *gRandomX, *gRandomY, *gOut;

	randomX = new float[iterations];

	randomY = new float[iterations];

	out = new float[iterations];

	cudaMalloc((void**) &gRandomX, iterations);

	cudaMalloc((void**) &gRandomY, iterations);

	cudaMalloc((void**) &gOut, iterations);

	for(unsigned int i = 0; i < iterations; i++)


		randomX[i] = (rand() / (float)RAND_MAX);

		randomY[i] = (rand() / (float)RAND_MAX);



	cout << "Finished generating random input." << endl;

	cudaMemcpy(gRandomX, randomX, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	cudaMemcpy(gRandomY, randomY, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	cudaMemcpy(gOut, out, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	unsigned int blockSize = 4;

	unsigned int numBlocks = iterations/blockSize + (iterations % blockSize == 0 ? 0 : 1);

	withinCircle<<< numBlocks, blockSize >>>(gRandomX, gRandomY, gOut, iterations);


	cudaMemcpy(out, gOut, sizeof(float)*iterations, cudaMemcpyDeviceToHost);

	unsigned int hits = 0;

	for(int i = 0; i < iterations; i++)


		cout << out[i] << endl;

		hits += (out[i] <= 1.0f) ? 1 : 0;



	delete [] randomX;

	delete [] randomY;

	delete [] out;





	cout << (iterations/(float)hits)*4 << endl;

	return 0;



Zero output usually means the kernel is never running. If I were to guess what is wrong, it would be that the number of iterations is too large for the GPU available GPU memory (your code requires 1.2Gb of dynamic GPU memory if I am reading it correctly). If you add some error checking to the cudaMalloc and cudaMemcpy calls, and add a cudaGetLastError() call after the kernel launch, you should see where the code is failing. I would further guess you could confirm this hypothesis by reducing iterations to a small number, like 100, and see whether it works.

Adding to avidday’s reply, you will also want to change the allocations to

cudaMalloc((void**) &gRandomX, sizeof(float)*iterations);

	cudaMalloc((void**) &gRandomY, sizeof(float)*iterations);

	cudaMalloc((void**) &gOut, sizeof(float)*iterations);

so you don’t scribble outside the allocated memory (which is likely to result in the failure you described).

Aha! This fixed it, thanks so much!

I was unaware that CUDA’s malloc wanted the size in bytes, rather than the number of elements. Looking back, this makes sense now.

Thanks again.

Avidday, thank you for your help as well.