Where is the basic help to CUDA?

Okay so I have finished a successful install of CUDA 2.3 and I am ready to go into it and start getting something simple to work.
I have looked at the SDK examples and tried to make some basic stuff of my own just adding numbers etc etc.
But I must admit that I am disappointed at the amount of tutorial stuff available for CUDA coding. Everything is about how great it is, and how much you can improve your work efficiency etc.
But where are the basic stuff?
Where is the basic “hello world” example?
Where is the example where you get shown once how to make a simple vector or matrix and do some simple math operations on them?
Where are the examples if you want to do simple loop structures?
In general all of those things that every other coding language show in their intro material by example.
If CUDA is meant to provide easy access to supercomputing to the masses, would it perhaps not be prudent to at least provide those basic examples along with the intro material so more people can use it straight out of the box?

If anyone knows about such material either online or in book form please let me know, since I would rather get on with having some actual code structure problems to look at rather then syntax problems.

The SDK has an awful lot of examples in it.

Also, have you read the Programming Guide? Section 3.2 goes over a matrix multiply in quite some detail.

CUDA does not allow I/O from the gpu, so forget about “hello world”

CUDA Programming guide, chapter 2 and 3.

Loops often disappear with CUDA, being the loops iteration distributed among threads. That’s part of the programming model, that you MUST know to develop applications on CUDA. Section 3.2. When they do not disappear, they are just plain C for/while loops.

CUDA is an easy front end to GPU power. Ask the guys who tried using OpenGL or DirectX, or with Cg shading language… It is a C extension. Knowing C is prerequisite: “C for CUDA provides a simple path for users familiar with the C programming language to easily write programs for execution by the device.” If you are not familiar with C, just learn C before.

Sure, C+MPI is easier: it is ANSI C + an API. After you have written your application you will need your personal datacenter to run it. The nvidia slogan is referred to the computing power current GPUs have (order 1TFlop) with respect to current CPUs (order 10GFlops). SIMD/SIMT has more constraints than MIMD, so it is more difficult. At least it is not familiar. But I think nobody has never tried to state the opposite.

I think the CUDA forum will be glad to give u all the help you need, expecially if you start making real questions, instead of blaming a documentation that IMO is excellent, as excellent are the examples in the SDK. The architecture is different, so you need to study it. The programming guide is your entry point. Study it carefully and you will find most of what you need.

Thanks for an excellent fast response sigismondo!

I guess what I am searching for is just some more down to earth examples to start out with studying, rather then starting to look at “simulations of the Earth’s crust”. And simply just wondering that after having been out on the market for years now, no simple cookbook seems to be around for CUDA.

But enough of that, for a more constructive question. Do anyone know if there is a library already made that have basic math functions and operations already that may be callable in CUDA?

Yeah, you have got the point. The every sw developer dream is a cuda enabled compiler, able to understand speech human natural language as it is an auto parallelizing one for standard clusters… it seems we still need our brains for these tasks :haha:

cublas (linear algebra) and cufft (FFT!) should already be in what you have downloaded (they are there in the windows download).

Take a look at the Dr. Dobbs articles:


Yes I found them thanks. Guess I will just find ye olde dusty C book once more and brush up on the basics, been doing nothing but fortran and python coding for years now.
I have got pycuda, but even there you cant get away from the C structures but at least you do not have to do as much memory allocation and cleanup. Oh well guess a few more grey hairs over syntax learning will not hurt. :haha:

PGI is going to release a Fortran for CUDA compiler very soon.
The basic programming model is the same as C for CUDA ( threads and blocks), but you can use the Fortran syntax.

Thanks mfatica, I am looking over Dr. Dobb’s articles now with ye olde C tutorial. Must admit that his examples are a lot nicer to start out with then some of the other stuff that I have found. While it is nice to see fancy simulations, and what you can do with CUDA, then it is still a bit better to start out slow. I think that I will just find some basic C examples instead and try to update them with CUDA additions as soon as I get the basic syntax and programming model nailed down.

What you are telling about Fortran also sounds interesting, since then I can do benchmarking of the Fortran version of my program as well as the Python and soon C version all with their respective CUDA additions!

Aiyen, I feel your pain! Why doesn’t nVidia either publish, or pay someone to write and publish, an “introduction to Cuda” or “Cuda for Dummies” or some equivalent book. The idea that bullet-points from a seminar/tutorial is a suitable alternative is incomprehensible. Furthermore, the idea that we live in a paperless society, therefore books are an old idea, is itself getting old. If the nVidia site were better documented and searchable then it wouldn’t be a problem. But the site doesn’t have those properties.

nVidia could help themselves by listening and reading your/my thoughts.


I have found that my code often follows this pattern

  1. all threads work as a Team to copy some data from global to shared arrays, each thread responsible for copying 1 value

  2. threads do their stuff on their bit of the shared array, e.g.

  3. threads work as a team again to save results from shared to global

Ensuring that 1) and 3) take advantage of the coalesced read/writes of the hardware is important for performance

(See the examples in the CUDA Programming guide, Performance guidelines)

To take advantage of that coalesced read/write it appears that there are two basic templates that are useful, I call them the Box and the Broom

The Broom is thin and wide (e.g. 1 row and 256 columns) and very useful for say calculating the total of each column in a 10000x10000 array. In this case there would be 10000/256 blocks, and each block would calculate totals for 256 columns (each thread do 1 column) by reading its 256 columns in the 1st row, then moving to 2nd row,…

For cases where you want to compare the value of one cell with its neighbours the Box is better. It is usually a 16x16 array. i.e. 1st 16 threads copy 16 adjacent values from the global to the shared, next 16 threads do next row,…

Boxes can move efficiently down rows or across columns too,

Both of these have lots of uses other than those examples, and variants. For example a broom that reads 3 rows of 130 columns so that the central 128 cells have neightbours on all sides.

Hope that helps.

To add to your and MMB’s suggestions I think a collection of snippets of code that people might frequently need would be useful, just to help people get started.


First off to MMB… I would not go as far as saying that Nvidia should publish anything or pay someone to do anything. They should just focus on providing and developing CUDA so it can be used effectively. My point was more that I am amazed that there are not more community made material to help new users get started, since its been years since CUDA was released.
While I must admit I love this forum and how fast people seem to respond, then I guess I had just expected more material to be available for newcomers.
But like mfatica suggested, then the Dr. Dobb web page is a fairly decent place to start. I have learned a lot by just sitting and playing around with his first 2 examples.

On that note, then thanks Kbam for your response, it puts some light on what I am struggling with now. Currently working on understanding the thread block aspect. I still do not understand all of it, but I like learning so its all good!


I am plowing through everything myself. I have to make alphabetical textfiles about the functions, some of which are documented in the Reference Manual, and SOME OF WHICH ARE NOT. I am flailing about for HOURS because some aspect of Linux / Unix / C++ / CUDA isn’t explained. I have listened to the Univ of Ill classes repeatedly and studied their slides. I have played with code samples galore.

For example: they SAY CUDA is an extension of C. OK, well, in the code samples in the Programming Guide, the operator -> appears. My C books describe this operator, allright, but clearly it’s a different thing. Is it a C++ overloaded operator? WTF why can’t somebody just say so!?

I have a growing list of undefined, undocumented functions (I have to learn OpenGL plus CUDA). Maybe I’ll find them in the OPenGL documentation, maybe I’ll have to Google them.

Don’t think for a moment that Google is the savior! Sometimes you can spend an hour reading stuff that LOOKS relevant but turns out not to be.

That list I speak of comes from the code samples in the OpenGL operability section of the Programming Guide. I want to UNDERSTAND those code samples, but, you CAN’T understand undocumented code! You can flail, and try, and kluge, but somewhere, in a manual that purports to be a starter’s document, EVERY WORD should have a definition.


Maybe not eyeless in Gaza, but sure as hell clueless in Potsdam.