How did the CUDA experts get started with CUDA programming?

I am new to learning CUDA. Most of the ways and techniques of CUDA programming are unknown to me. I am a self-learner. I have a very basic idea of how CUDA programs work. For example, the very basic workflow of:

  1. Allocating memory on the host (using, say, malloc). Storing data in that host allocated memory.
  2. Allocating memory on the device (using, say, cudaMalloc, using the CUDA runtime API), copying the data from the host to the device using cudaMemcpy
  3. Writing a kernel. Know how SM, Thread Blocks, Threads, and warps in CUDA work. Set the number of thread blocks, threads per block on the default stream, and launch the kernel using the <<< >>> kernel launch directive.
  4. Copy the data from the device to the host using cudaMemcpy.

While I know the basics of CUDA, much of the entire facility made available by CUDA is unknown to me. Currently, when I need to get something done, I search for that particular item and read blogs from CUDA to understand it. (But while doing so, I find mention of specific terms unknown to me; for example, while trying to learn about CUDA Graphs, I found references to CUDA streams and functions like cudaMallocHost. I feel down, seeing that I do not know them, and try to learn recursively, go depth-first, with a high possibility of getting lost.)

My question might be opinion-based, but how did the current CUDA experts master CUDA programming so they do not have trouble reading any CUDA blogs or anything new?

Usually, do people read the CUDA C Programming Guide reference manual from Nvidia completely to have a rough grasp of CUDA (to know what is available in the toolbox named CUDA), or do they read CUDA-related books (like CUDA by Example by Sanders et al., etc)? Or look into the manual as and when required, avoiding reading the entire book or manual.

How do we have a breadth-first knowledge of what is made available by CUDA? That too in a time-bound manner, given that I have a time constraint.


I do not know whether my question is off-topic or not. If so, please let me know, and I shall delete it.

1 Like

I am by no means a CUDA expert, but I also went thru the self-learning process. I highly recommend Robert’s CUDA training series: https://www.olcf.ornl.gov/cuda-training-series/. I have found it very helpful!

I do agree that CUDA C Programming Guide is too big for new learners, it’s more like a reference book, when you need to confirm some function semantics etc.

And the second suggestions is maybe trying to write some code, thru which you will have a better understanding of the programming techniques. Basically exercise. Some example application I have found helpful are:

  • vector addition
  • grid reduction
  • softmax (row-wise)
  • matrix vector multiplication
1 Like

That is probably impossible at this point. Just like it is pretty much impossible with C++. The ISO-C++ standard alone (just the bare facts, with close to zero explanation) is a massive volume, and it saves hundreds of pages by simply incorporating the ISO-C math library by reference. Nobody (or at least nobody I know) reads the C++ standard front to back like a novel. Just for one (powerful) C++ feature, template metaprogramming, you can work through an 800-page book representing the combined knowledge of three experts in dead-tree format.

I was one of the first CUDA programmers and CUDA was based on C, not C++, at the time. Fun fact: Kernels could only take a single argument! Kernel with multiple arguments had to be passed arguments as a struct. CUDA language features and CUDA-associated libraries were quite limited, and it was possible to retain a good overview over the entire CUDA universe by growing one’s knowledge organically with the language and its ecosystem.

By 2014 language features kept on progressing rapidly, mostly through the addition of many modern C++ features, the simple text-based profiler turned into a sophisticated visual tool, and a new library seemed to pop up every half year. Unless one spent all day just exercising new functionality, it was impossible to maintain a breadth-first understanding.

My personal approach to dealing with “drink from the firehose” scenarios is to pick one problem I actually have to solve (or that I am highly interested in solving) and start with whatever little bits and pieces from a language environment that look recognizable and applicable, get something to work, then hunt for more information to incrementally expand that towards the eventual solution. These small intermediate successes are psychologically important. Without them, frustration can take a hold.

During the hunt for information, I often come across interesting-looking information that does not seem immediately useful and I file that away, as PDFs, or URLs, or hand-written notes. Often, such material has become relevant years, even decades, later.

The way I picture this is: You have to walk before you can run, and crawl before you can walk. Don’t be discouraged by not being able to run from the start. From experience, it is amazing how one can go from nibbling on one corner of a huge mountain to tunneling all the way through it, by keeping on nibbling. Granted, that still leaves most of the mountain unexplored. And that is OK.

Now, repeat the exercise with different kind of problems and after some years you may have solid knowledge of a third of the overall language universe, and you may become an expert in 10% or 20% of it in due course. In the ideal case, other people around you (or on the internet) will be experts in different segments, and you can pick up some interesting insights from them for hints of a breadth-first view.

4 Likes

@njuffa If you or Nicholas Wilt published an anecdotal memoir of CUDA, I would be the first one to buy.

1 Like

I am afraid it was all an intensive but otherwise rather mundane process and there are no juicy tidbits to report. In any event I am not a book-writing guy so if someone were to author such a book Nick would be a likely candidate given that he is already a published author with multiple titles under his belt.

The way I joined the CUDA project early on (summer of 2005, I think) is a bit curious.

I had been working at NVIDIA for two years, on handheld products. One day my boss’s boss (a director by title) showed up at my cube and said “Norbert, in recent months I have gotten the impression that you are rather unhappy with your current project. Let’s sit down and discuss where we could find a better match for you at NVIDIA”. I was very surprised, but his impression was astute and on the mark. So we found an empty conference room and he went through a list of other projects that needed staffing. At each item I replied that it did not sound particularly appealing to me.

He finally got to the very last (the sixth, maybe) item on his list and said something to the effect of “I am a bit hesitant to bring this up as this project isn’t even staffed yet and its precise boundaries are still somewhat in flux, but we have initiated a new project for a parallel programming environment based on GPUs”. To which I replied, “Now that sounds interesting. Tell me more!”.

A few days later I met Ian Buck, who had come to NVIDIA from Stanford a few months earlier after finishing his Ph.D. there and was spearheading the development of CUDA. The initial team lead by him was just half a dozen people (me included), if I recall correctly.

6 Likes

My initial experiments were all based on the CUDA toolkit sample code. Modify, extend, experiment! Then I went on to to code some toy projects running on early nVidia hardware, such as:

  • Rendering iterative function systems (fractal flames), using shared memory for acceleration
  • rendering Mandelbulbs with nVidia Optix (3D fractals based somewhat on the Mandelbrot set)
  • an attempt to port the “Evo Lisa” (approximate art with a genetic algorithm) to CUDA

but also interesting maths project such as

  • implementing the Karatsuba multiplication algorithm using shared memory
  • implementing big integer multiplication using the first gen tensor cores
  • implementing fast primality sieves and primality tests in CUDA

Most of this stuff predates the use of CUDA in my professional work, where I did a lot of small matrix maths for MIMO antenna simulation.

But I think I might be known mostly for my development work on CUDAMiner and CCMiner, which implement the earlier PoW algorithms for cryptocurrencies in CUDA.

I used online tutorials, as well as forums (like this one) to learn the basics, gather inspiration and share knowledge.

I’ve missed out on a lot of the more recent developments, such as the CUDA Graph API. I have not used the cooperative groups API a lot. So I still have a lot to learn.

1 Like

As a person who has been programming in CUDA for about six years, I can say that the first thing you can do is build a simple but robust way to taking memory from the host to the device and back. Thrust is one way to do it, but the first program I started working with had its own solution, coined before Thrust was a thing, and I have since made an evolved form of the templated class, which allocates twin data arrays on the host and device in different configurations (pinned, unified, decoupled, host- or device-only) and incorporates (in its own way) many of the conveniences of the C++ std::vector<T>. Once you have that, you can start to write C++ functions that do the same things as the GPU kernels operating on the same memory layout are intended to do. Upload some data, fire off the kernel, download the data, and check the result against your single-threaded CPU program. That’s what I’d recommend.

Once you’ve mastered that, you have a few major constraints in the CUDA kernels you write: the available registers per thread, the amount of L1 and the portion of L1 that can be cordoned off as __shared__, the overall size of the global L2 cache, and the speed of the memory bus. Optimizing the performance of your kernels within those limits can be tricky, but this is why there is a place for expertise not just in terms of CUDA fluency but also in terms of the puzzle-solving work to re-arrange and algorithm in a way that will be amenable to the massively parallel hardware which has such limited cache.

If you are not just new to CUDA but also new to parallel programming, you need to understand the concepts of race conditions, non-associativity in floating-point arithmetic (and the fixed-precision arithmetic which can salvage bitwise reproducibility), and the fact that the CPU keeps going after firing off a CUDA kernel. You can use that last fact to great advantage or get burned by an unexpected race condition.

There is so much available in CUDA nowadays, from the ability to put print statements in kernels to graphical tools for visualizing your thread occupancy and newer, better versions of the memory checker. The ability to make your algorithm run on a single CPU thread and check the result provides development capabilities along a different axis. The most performant CUDA, in my view, has the GPU executing intricate work units arranged by the CPU again and again. That’s what I’ve been building for the past two and a half years.

1 Like