New to CUDA, simple kernel give output of zero.

GenTiradentes · April 4, 2010, 6:58am

I’m a programmer new to CUDA, looking to use GPGPU in a certain embarrassingly parallel problem I’m working on. I recently found out that my crappy integrated Geforce 8200 chipset actually supports CUDA, so I installed drivers for it, hooked up a second display, installed the SDK and examples, and actually got the example apps to run on the hardware.

Next, I tried following a tutorial to get a simple floating point benchmark running. It’s a program I write and use all the time, in various languages, using different compilers and optimization techniques, to see what kind of performance each system offers. The program is a (slow and inaccurate) Monte Carlo method of calculating pi, which is heavily dependent on floating point performance. I love this problem for benchmarking because it’s easy to write, and embarrassingly parallel.

So I wrote a version of this program using CUDA, and the API has been surprisingly easy to learn and use, I must say. However, I have run into a confusing problem. Whenever I run my application, the output from the GPU is always an array full of zeros. I’m not sure what I’m doing wrong here, and I would greatly appreciate the advice of people more experienced in this.

#include <iostream>

#include <stdio.h>

#include <cuda.h>

#include <ctime>

#include <cmath>

using namespace std;

__global__ void withinCircle(float* x, float* y, float* out, unsigned int num)

{

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	out[idx] = sqrt((x[idx] * x[idx]) + (y[idx] * y[idx]));

}

int main()

{

	unsigned int iterations = 100000000;

	srand(time(NULL));

	

	// Host pointers

	float *randomX, *randomY, *out; 

	

	// Device pointers

	float *gRandomX, *gRandomY, *gOut;

	randomX = new float[iterations];

	randomY = new float[iterations];

	out = new float[iterations];

	cudaMalloc((void**) &gRandomX, iterations);

	cudaMalloc((void**) &gRandomY, iterations);

	cudaMalloc((void**) &gOut, iterations);

	for(unsigned int i = 0; i < iterations; i++)

	{

		randomX[i] = (rand() / (float)RAND_MAX);

		randomY[i] = (rand() / (float)RAND_MAX);

	}

	

	cout << "Finished generating random input." << endl;

	cudaMemcpy(gRandomX, randomX, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	cudaMemcpy(gRandomY, randomY, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	cudaMemcpy(gOut, out, sizeof(float)*iterations, cudaMemcpyHostToDevice);

	unsigned int blockSize = 4;

	unsigned int numBlocks = iterations/blockSize + (iterations % blockSize == 0 ? 0 : 1);

	withinCircle<<< numBlocks, blockSize >>>(gRandomX, gRandomY, gOut, iterations);

	cudaThreadSynchronize();

	cudaMemcpy(out, gOut, sizeof(float)*iterations, cudaMemcpyDeviceToHost);

	unsigned int hits = 0;

	for(int i = 0; i < iterations; i++)

	{ 

		cout << out[i] << endl;

		hits += (out[i] <= 1.0f) ? 1 : 0;

	}

	

	delete [] randomX;

	delete [] randomY;

	delete [] out;

	

	cudaFree(gRandomX);

	cudaFree(gRandomY);

	cudaFree(gOut);

	cout << (iterations/(float)hits)*4 << endl;

	return 0;

}

Cheers.

avidday · April 4, 2010, 7:47am

Zero output usually means the kernel is never running. If I were to guess what is wrong, it would be that the number of iterations is too large for the GPU available GPU memory (your code requires 1.2Gb of dynamic GPU memory if I am reading it correctly). If you add some error checking to the cudaMalloc and cudaMemcpy calls, and add a cudaGetLastError() call after the kernel launch, you should see where the code is failing. I would further guess you could confirm this hypothesis by reducing iterations to a small number, like 100, and see whether it works.

tera · April 4, 2010, 10:26am

Adding to avidday’s reply, you will also want to change the allocations to

cudaMalloc((void**) &gRandomX, sizeof(float)*iterations);

	cudaMalloc((void**) &gRandomY, sizeof(float)*iterations);

	cudaMalloc((void**) &gOut, sizeof(float)*iterations);

so you don’t scribble outside the allocated memory (which is likely to result in the failure you described).

GenTiradentes · April 4, 2010, 4:59pm

Adding to avidday’s reply, you will also want to change the allocations to
cudaMalloc((void**) &gRandomX, sizeof(float)*iterations);

	cudaMalloc((void**) &gRandomY, sizeof(float)*iterations);

	cudaMalloc((void**) &gOut, sizeof(float)*iterations);
so you don’t scribble outside the allocated memory (which is likely to result in the failure you described).

Aha! This fixed it, thanks so much!

I was unaware that CUDA’s malloc wanted the size in bytes, rather than the number of elements. Looking back, this makes sense now.

Thanks again.

Avidday, thank you for your help as well.

Topic		Replies	Views
New to CUDA, simple kernel gives output of zero CUDA Programming and Performance	0	7865	April 4, 2010
The kernel always returns values equal to zero CUDA Programming and Performance	10	8070	February 2, 2018
cuda in c++ project cuda kernel produces no output CUDA Programming and Performance	1	6001	July 5, 2011
Cuda Newbie - Cuda Returning 0's CUDA Programming and Performance	2	916	April 9, 2015
Array not getting changed. CUDA Programming and Performance	0	1248	July 14, 2009
Receving Wrong Output for Simple Cuda Program -Newbie CUDA Programming and Performance	1	2432	June 22, 2011
Why Do I Always Get Zero From My Code CUDA Programming and Performance	2	777	November 4, 2013
Kernel being skipped? Please test my code CUDA Programming and Performance	1	2797	July 7, 2010
CUDA Kernel seems not to be excecuted CUDA Programming and Performance	1	763	July 11, 2009
if(condition) problem in CUDA CUDA Programming and Performance	5	455	March 13, 2019

New to CUDA, simple kernel give output of zero.

Related topics