For my senior thesis, I have written numerous computer simulations of various kinds, such as cellular automata and kinetic growth models, along with ODE and PDE solvers (reaction-diffusion systems, molecular dynamics simulations). But I’ll admittedly say that I’m a bit lost when it comes to all this talk of threads, blocks and memory access. However, I just got a computer with a CUDA enabled chip (GTX 260M) and I would like to leverage its power for my simulations.
All the documentation I’ve found so far feels like it was written for people whose whole life is programming and computer hardware and software. I am a guy who can write “semi complex” C programs and do a whole bunch of calculus, etc… but I don’t know how to take my small, simple programs that solve equations of motion for N-bodies or update cells on a 2D lattice automata, and make them faster with the capabilities of a CUDA chip.
Can anyone recommend any first steps or documents to read? A lot of the CUDA sample code I’m reading seems indecipherable… I’m excited about CUDA, but I feel like I’ve stepped into a land I know nothing about :wacko:
My problems are only four kinds, really:
Cellular automata usually on rectangular lattice
Numerical integration (multidimensional usually)
N-body ODEs
Solving PDEs on a simple rectangular lattice
Any suggestions for a simple “get started writing simple parallel programs” for someone who isn’t a real “computer programmer” (just a physicist with some minor C skills?) Is it really complicated to take a serial program that does, say, N-body ODE integration and turn it into a parallel program?
Don’t worry too much about this. Everyone, including those with more programming experience, was certainly confused at first as well. The CUDA programming model and terminology is unfamiliar to most people, even (or perhaps especially) for those with previous parallel programming experience.
I would start by reading the CUDA Programming Guide (it’s included in the doc directory of the CUDA toolkit). Print it out, put it under your pillow. Read through chapters 1-5 (chapter 3 is a little dense, so you can skim that the first time through). Then go try to write the “Hello, world” equivalent in CUDA, which can be something like load two 1024 element vectors into GPU memory, write the sum of the two vectors to a third vector, and then read the sum back to host memory. Once you can do that, you’re 50% of the way there. :)
(You might want to go back and read the Programming Guide again at this point. More of it will make sense.)
There are also some excellent class lectures on CUDA from UIUC:
The answer to this question is “It depends.” There are many different parallel programming models (hence the reason that people with previous parallel programming experience in a different model are often more frustrated with CUDA at first). CUDA is aimed at data parallelism, specifically. The essence of data-parallel programming is “do the same thing to many data elements at once.” Not all tasks map to the data-parallel model, so don’t try to force it. That said, there are often clever ways to recast a problem to be more amenable to this programming style, so it doesn’t hurt to Google for [my programming task] + “CUDA” to see if someone else has solved your problem before.
By lowering the overhead of having more threads to nearly zero, CUDA encourages you to map one thread to one data element, which is different than other parallel programming models, which tend to encourage you to map one thread to one task, or one thread to one CPU core. To maximum usage of the GPU resources, you want to be running thousands of threads, even though your GPU only has 112 stream processors.
To take cellular automata as an example: a simple implementation in CUDA is to assign each lattice site to a thread. The kernel, therefore, only has the code required to take the block and thread indices (the pair is unique for each thread), convert that to a lattice location, read the nearest neighbors from the lattice array, compute the new state of the lattice site and write the result to the appropriate location in the output array. (The output array should be separate from the input array to avoid race conditions as you update.) Notice the kernel has no loop in it! To compute the next iteration, swap the input and output pointers, and run the kernel again.
There are many subtleties to making the cellular automata example run fast, mostly having to do with how the memory subsystem works, but you shouldn’t worry about that just yet. Try writing the simple version first, planning to throw it away when you are ready to make it fast. :)
If you’re familiar with MATLAB (and you have access to it), try the Accelereyes Jacket product. You could probably get a good speedup from that without having to learn a whole bunch of new stuff (or get into the C programming needed for CUDA).
I would recommend looking at the UIUC (?) website for the draft book by David Kirk and Wen-mei Hwu. The first half of this book is available in the form of 6 pdfs. It details everything you need to know about writing a basic program in CUDA, including some simple but effective optimization tips. I found it to be very useful, and much less dense than the programming guide.
I think understanding parallel computations involves understanding the basic operations (of computations) which a computer/chip/cpu/gpu performs.
There are basically three operations which a computer/chip/cpu/gpu performs:
Reading.
Writing.
Addressing. (Where to read/write)
The rest can be classified as:
4. Calculations/instruction execution.
To be able to program parallel computers successfull it’s now extra important to understand the reading(1)/writing(2) and addressing(3).
1 and 2 are important to understand potential racing conditions, one thread might be reading and one thread might be writing at the same time which can cause corruption or reading old values instead of new or overwriting new values with old.
3 is important to understand how to distribute the work load over multiple addresses, this is where the indexing comes into play. The indexing based on the build-in variables such as BlockIdx.x and such leads to index calculations who’s addresses become distributed across memory and probably also “chips”. Apperently the hardware can understand these indexes and/or addresses and knows how to distribute it to what chip/core. Also the kernel launch spawns multiple threads.
So there is a 5th element at play:
Threading/program execution. Each thread involves/has an instruction pointer plus additional registers which tell the gpu how to execute the thread (where it left of, or where it was in the program/kernel when it got swapped to execute another thread) and the instruction which it is supposed to execute.
So it also involves being able to think in term of “parallel flows”… “parallel logic flows” and “parallel data flows”.
These concepts will probably return in any higher level language as well, the language might automate the distribution somewhat, though these problems might still creep up in higher languages here and there. At least higher languages can hide some of the nit and gritty/dirty plumbing and could reduce “stupid c/c++” progamming mistakes External Image :)
A good introduction to the basics of computation would be a programmer’s game called “redcode” or “corewar”.
I happened to play/enjoy it… and it has learned me a lot. It does require some time… I spent many months on it, perhaps even 1 or 2 year… I can understand if you don’t have the time for that. But perhaps a quick look can’t hurt External Image