Suggestions for CUDA article

I’ve been playing around with the idea of writing a beginners+ article regarding cuda.
I’ve been thinking of writing about certain things that many programmers here have some problems
with mainly: debugging, understanding why kernel fails all sorts of things, some of them are in the
programming guide but with a more hands-on empasise. Also about multi-gpu code.

I’d be happy to hear what you think? what you’d be interested in…


The first two days I ever played with CUDA, I was in a bad mood because there was no clear step by step dumb newbie tutorial for getting “hello world” to run. There’s several great newbie projects in the CUDA SDK, the simpleXX series especially, but there wasn’t any real guidance to know that’s what I should try first, and a step-by-step dummies guide to get that first project compiled.

Sure, after a few days I was doing OK, and now a year or two later I laugh at the memory, but I do know that it was discouraging that there was no handholding intro, even a one-page “try this first” guide. Even a sticky forum post would have helped.

eh, that’s a good idea. let’s see, do I have this lying around… I think I do, let me try to get that posted in the not-too-distant future.

Attached you can find a word document containing what seems to be my first CUDA article.
The paper attempts to address a few things I’ve stumbled upon when starting programming CUDA
plus some answers/code samples answering, what I thought, was common misunderstandings of
people when starting to write CUDA code, as can be seen in these newsgroups.
I’d appriciate any feedback/ideas/comments/remarks/… :)

eyal (23.3 KB)

There are so many reasons why distributing Word documents is just a really, really bad idea.

Human readable form would be even more appreciated … PDF perhaps, like most other CUDA documents?

edit: :thumbup:…1/cuda-textbook

Do you really feel like the CUDA forums are the best place to express your opinions about which document formats are acceptable for use?

He ask for comments and or suggestions. I gave him one, which I believe is important, and is the exactly the same advice I give my colleagues and students when they write proposal/papers/reports/thesis for circulation or distribution. Is there a problem with that?

The idea behind the paper was to assistance CUDA beginners, I was hoping for more technical or even a “that is helpful/the article is not good/I’d like to see info on…”

something along those lines. Doc file was the fastest.

Anyway thanks for the advice.


All formatting got lost when I opened the thing in openoffice, but so what :)

From the look and feel to it, this is a nice article. Publish it.
Constructive criticism:

  • Point out more distinctly that newbies are not your target audience.
  • DO NOT USE SDK FUNCTIONS, make it more self-contained
  • emphasise again and again and again that all speedups are completely irrelevant if not measured against an optimised CPU code.

Sorry, I don’t have time currently to review this further. But: We are looking for stuff like that on If you consider a place to publish your article, let me know and we’ll work something out…


Hi Dominque,

I tried to PM you, but the forum didnt let me. Maybe because of a charcterset issues…

It would be a great pleasure to publish for What did you have in mind? how can we proceed from here?

  • The .doc file was the fastest, and as you can see in the forums, I got bashed over it :)

    • Actually I meant the article to be for newbies+, why do you think its not for newbies?

    • The SDK functions were put there to emphasise bad practice. In the final example I use cuda events

      instead of timers (I’m aware of tmurray wrath about this ;) ). did I miss something there?

    • The sspeed up measurements remark is correct. I should probably put some more stuff regarding how to measure

    the speed ups and issue some of the misconceptions seen in the newsgroups regarding this issue.

many thanks and looking forward to hearing from you.

Eyal Hirsch

My buch of suggestions and comments:

  • If beginner doesn’t want to read Programming Guide, don’t force it upon him. After all he took your shorter article not to read the long Programming Guide (at least at first). I believe reading some simple examples is actually better than jumping into PG immediately.
  • I wouldn’t use some unknown myKernel function. Put there something, so that reader could copy-paste it, compile and run. I think a kernel which computes sum of all elements may be a good choice. First version would simply do it in sequential form <<<1,1>>> while more advanced version would perform a reduction <<<1,X>>>. This code will be still easier to understand than matrix multiplication from PG apprendix. Besides, prefix sum is used soo often and beginners usually think of atomic instructions at this point.
  • Maybe add some sections so that reader knows where he can make a pause.
  • Understanding cudaMemcopy is crucial, however synchronization can be talked about somewhere later. Don’t throw all those details on a newbie reader!
  • use different fonts for code and its comments (e.g. italics for comments and monospace for code).

First, thanks a lot for the great input :)

Yes you’re right regarding the PG. If people took the time too read it, I guess at least 10% of the questions in the forums wouldnt

have been asked. This was the main motivation behind the article, plus the fact that I think that the PG lacks that sort of quick

hands-on paper for people who are looking for something lighter than the PG to get them started.

Amazing… that was my exact thought of some future paper (should this one be read and appriciated by people). I myself looked at

this in the beginning and asked the forums about reduction. There are indeed a lot of people asking this and I agree with you totally

that it is much simpler to understand then the matrix mul sample.

I thought of showing faulty implementation of the reduction in order to better explain how easy it is to get race conditions in CUDA.

I even seen a few (not reduction related) posts here where people did for loops per thread, instead of breaking the loop into different

threads. I thought a simple explaination with reduction/sum why this doesnt work would be the easiest to understand… amazing :)

I didnt understand this one…

Thats true, maybe i should somehow break it and add more misuse cases that can be seen in the forums (such as trying to allocate

memory from inside the kernel).

again many thanks for the great input…



I simply suggest to add some sort of short chapters to your guide.

One of scenarios of reading a guide is to: read part of it - play with it/implement it, check how it works - read next part of it - play with it - etc, etc…

Having the guide divided into chapters helps the reader to decide when to stop reading and to start coding something.

Also I would suggest trying to limit the number of false examples (codes that do not work), at least at beginning. Give the reader something that works first so that he can have a taste what can be done.

It could be something stupid like 2+2 computed on GPU :) [sort of ‘hello world’].

Regarding reduction, it seems to be a good order:

  • working single-threaded code

  • multithreaded-code with race conditions

  • correct multithreaded-code

Also, once you are done with this text, I would suggest a slightly more advanced course on bank conflicts. Reading just PG on this makes it look very very complicaded!

The “minimal” CUDA code walkthrough that Mark and I came up with for is axpy.…lease_id=673202

I believe the code is self-contained and covers all the basics. The “beauty” of it is that you can’t do anything wrong for this kind of “map” operation in CUDA: We did this because this operation is used in the prominent CUDA articles in ACM queue. This matches PDan’s suggestion to start with a simple working code and to discuss common errors later.

Funny to see that reductions are apparently a nice didactic approach. We’ll be doing pretty much exactly the same thing in upcoming CUDA hands-on sessions: Sequential summing within one block for starters and working from there…

let’s peer-review here. eyal, please email updates (dominik.goeddeke AT Once the document is settled, I’ll see to posting and eventually integrating it at Can you develop the tutorial in HTML? That’d be cool… Also, please do not try to reinvent the wheel over Wen-Mei’s CUDA book.