Simple "Hello CUDA" example?

Could someone point me to a simple example CUDA program, preferrably in C?

I’ve installed the drivers & SDK, and can use the script to compile & run most of the example programs. However, when I look at the source and try to figure out how I might actually write a program of my own, I find that most of the details are hidden away in obfuscated .h files, which are themselves stuck off somewhere out of the standard path. I could of course spend the time to slowly hack my way through all this, but I would rather be doing something useful.


if windows:

else if (linux_suse_11) ?

But it seems I wasn’t clear about what I’m looking for. I don’t want a “wizard” that adds more behind-the-curtain stuff to organize what NVidia stuffed behind the curtain; I want a C (preferrably, though FORTRAN would do at a pinch) program that doesn’t have the curtains in it.

Or to put it another way, I’m the wizard (hey, it’s right there in my job description :-)), and I’d like to learn the details of this particular spell…

All the details you need are in the programming guide, including a step by step run through of a simple matrix multiplication example.

The only “hidden details” in the SDK samples are in cutil.h, which just has some useful error-checking macros. There simply isn’t a lot to the Runtime API. You don’t need to learn what this API is managing for you, especially if you’d like to be doing something useful. Just copy what the samples do, and worry about device code not boilerplate host code.

Or if you really love boilerplate code, look at some of the samples that use the Driver API. The driver api however doesn’t add anything useful. Resist the urge to use it.

Here is a simple test program I wrote to get my feet wet - it fills a matrix with numbers, and then takes the

cosine of the matrix. Compile with “nvcc -use_fast_math” (on Suse 10.3…)

N.B.: I claim to be an expert in neither C nor CUDA…

/* Cuda GPU Based Program that use GPU processor for finding cosine of numbers */

/* --------------------------- header secton ----------------------------*/



#define COS_THREAD_CNT 200

#define N 10000

/* --------------------------- target code ------------------------------*/

struct cosParams {

float *arg;

float *res;

int n;


__global__ void cos_main(struct cosParams parms)


int i;

for (i = threadIdx.x; i < parms.n; i += COS_THREAD_CNT) {

parms.res[i] = __cosf(parms.arg[i] );



/* --------------------------- host code ------------------------------*/

int main (int argc, char *argv[])


int i = 0;

cudaError_t cudaStat;

float* cosRes = 0;

float* cosArg = 0;

float* arg = (float *) malloc(N*sizeof(arg[0]));

float* res = (float *) malloc(N*sizeof(res[0]));

struct cosParams funcParams;

/* ... fill arguments array "arg" .... */

for(i=0; i < N; i++ ){

arg[i] = (float)i;


cudaStat = cudaMalloc ((void **)&cosArg, N * sizeof(cosArg[0]));

if( cudaStat )

printf(" value = %d : Memory Allocation on GPU Device failed\n", cudaStat);

cudaStat = cudaMalloc ((void **)&cosRes, N * sizeof(cosRes[0]));

if( cudaStat )

printf(" value = %d : Memory Allocation on GPU Device failed\n", cudaStat);

cudaStat = cudaMemcpy (cosArg, arg, N * sizeof(arg[0]), cudaMemcpyHostToDevice);

if( cudaStat )

printf(" Memory Copy from Host to Device failed.\n", cudaStat);

funcParams.res = cosRes;

funcParams.arg = cosArg;

funcParams.n = N;


cudaStat = cudaMemcpy (res, cosRes, N * sizeof(cosRes[0]), cudaMemcpyDeviceToHost);

if( cudaStat )

printf(" Memory Copy from Device to Host failed.\n" , cudaStat);

for(i=0; i < N; i++ ){

if ( i%10 == 0 )

printf("\n cosf(%f) = %f ", arg[i], res[i] );



/* nvcc -use_fast_math */

Thanks, that worked just fine (after I put in a couple of degree/radian conversions), and was much, much easier to understand. Especially the compile command: one line vs ~350 lines of the SDK Makefile :-)

Oh, I guess the fact that you’re on Linux has a factor there. On windows the samples have VS solutions where the custom build step also boils down to a line. NVIDIA should really look into simplifying this facet of the examples. Actually even on windows the line is unnecessarily long and hard to find (buried in the properties of the .cu file).

Plus I hate how there’s some very important nvcc options, like --ptxas-options=-v, that are pretty tricky to discover.

Would be great if NVIDIA had its own IDE, eg one based on Eclipse or on VS2k8 Shell. Could make for a much better and smoother experience.

Yeah, though I find it hard to understand why anyone would even contemplate doing serious number-crunching on Windoze. Games, sure, but (looking for handy foxhole) there’s more to life than games. Yet I can’t even install the NVidia graphics SDK, 'cause the download’s a .exe file.

Thanks for the hint :-)

I have to disagree there. I have my own preferred editor & so on, which I know how to use well and have customized (over many years) to fit my needs. Why take a big jump back down the learning curve, and take several times longer to accomplish a given task because I’d only know some bare minimum command set of whatever editor the IDE developers liked? Then drop it all and shift to something different when I need to work on a non-NVidia project?

Because unlike your beloved vi, you don’t need to spend a year learning a new GUI IDE ;) Obviously, having a GUI tailored for NVCC and its options, a gui that reports back important information, and, in general, a gui custom designed for CUDA would cut down learning curves for everyone, including you.

Lol, and why wouldn’t they? Numbers multiply the same way over here…

vi? Oh, get serious :-) As to spending a year, or however long it takes, to learn some new GUI IDE, I suppose that depends on what the IDE can do. If your IDE editor for instance just allows you to type things in (and I have seen ones in the Windoze world that not only just do that, but do in incorrectly), then yes, it may not take long to learn. But it will take several times longer to do any given task.

Perhaps having one that reports back useful information would be of some benefit, but I think it would be a whole lot simpler not to hide that information in the first place.

Only slower :-)

Where do you get that? CUDA has been observed to run the same on linux and windows. In general, anything that doesn’t spawn ten thousand threads and use all kinds of OS resources will run at the same speed no matter what the kernel is.

There’s nothing wrong with using Windows for HPC, and there are some very obvious useability advantages (especially if that’s what you’re familiar with). Linux has a clear advantage only if there is a particular application you want to use or code for. (Eg, right now I’m programming a project for Asterisk, which is a Linux telephony server.) Of course there are many such applications, but there is no other, general, magical reason why Linux is better.