Getting started with parallel programming Suggested reading

One question that I’ve been asked a lot is how to get started with parallel programming. I asked around internally at NVIDIA, and got some good suggestions. So I’m posting the responses and I want to encourage people to comment and/or add their own suggestions here. Does anybody have a favorite textbook they want to share? It doesn’t have to specific to CUDA.

    From HPC course at UNC which also covers CUDA. http://www.cs.unc.edu/~prins/Classes/633/

    From Mark Harris: It’s not a textbook, but I always recommend these course notes on PRAM algorithms (the CRCW PRAM model maps very closely to CUDA, especially within a thread block using shared memory) by Sid Chatterjee & Jan Prins. They are concise and provide good examples for reductions, scan, Brent’s Theorem, etc. http://www.cs.unc.edu/~prins/Classes/633/Handouts/pram.pdf

    He also requires reading from Kumar et al. Introduction to Parallel Computing: Design and Analysis of Algorithms.

    Designing and Building Parallel Programs, I. Foster, Addison-Wesley, 1995. http://www-unix.mcs.anl.gov/dbpp/

    IBM’s redbook “RS/6000 SP: Practical MPI Programming” is very famous for MPI users. It has rich contents about parallel approach even though they publish 10 years ago. http://www.redbooks.ibm.com/abstracts/sg245380.html

    Parallel Programming in C with MPI and OpenMP by Michael J. Quinn (Author) is good for beginners.

    Parallel Programming with MPI by Peter Pacheco

    Parallel and Distributed Computation: Numerical Methods (Optimization and Neural Computation) by Dimitri P. Bertsekas

    Multiple people suggest: The Art of Multiprocessor Programming by Maurice Herlihy (Author), Nir Shavit (Author) http://www.amazon.com/Art-Multiprocessor-P…y/dp/0123705916

    Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation) by Barbara Chapman (Author), Gabriele Jost (Author), Ruud van der Pas (Author), David J. Kuck (Foreword)

    Using MPI - 2nd Edition: Portable Parallel Programming with the Message Passing Interface (Scientific and Engineering Computation) by William Gropp (Author), Ewing Lusk (Author), Anthony Skjellum (Author)

    From David Kirk: We use Tim Mattson’s book as a companion to the CUDA material.

    David Kirk & Wen Mei Hwu’s CUDA text book is available on the course website. That is one rev out of date, but pretty close. http://courses.ece.illinois.edu/ece498/al/

    From Paulius Micikevicius: My personal favorite book on parallel algorithms is “Introduction to Parallel Computing” by Grama et al. It covers basic interconnect topologies, algorithms, analysis, MPI and OpenMP. http://www.amazon.com/Introduction-Paralle…a/dp/0201648652

    If one is leaning slightly more towards the theoretical side of parallel algorithms, then “Introduction to Parallel Algorithms” by Joseph Jaja is a good source. Contains a more thorough treatment of algorithms based on prefix sums (things like various tree and graph algorithms).

    If one wants to go completely to the theoretical side (P-completeness, etc.), then “Limits to Parallel Computation: P-Completeness Theory” by Ray Greenlaw is an excellent book. It’s certainly not applicable to introductory courses, in the same way that NP-completeness isn’t applicable to introductory algorithms courses.

This may be the same book referred to as “Tim Mattson’s book” above. The title is “Patterns for Parallel Programming” by Mattson, Sanders & Massingill. Addison-Wesley.

Thanks for the list - I do have a question though. What does nVidia suggest to do with a serial code that

needs to be ported to the GPU. I have two such kernels that were ported to the GPU, one successfully and the

other i got ~x4 factor (which is not enough).

Such code would look like this:

for ( int iSample = 0; iSample < 1000; iSample++ )

{

   for ( int i = -val; i < val; i++ )

   {

	  pRes[ iSample + i ] += someValue * i;   (**)

   }

}

to make my life harder the line marked with (**) might also look like this:

pRes[ ( rand() % 1000 ) + i ] += someValue * i;

This is real production code, the main reason for the code being so “nice and user friendly”

is because the algorithm tries to do some sort of averaging.

Hey - I didnt write the algorithm… some mad scientist wrote it… ;)

thanks

eyal

This link is good for beginners… (I myself referred to it when I ventured onto parallel programming)
https://computing.llnl.gov/tutorials/parallel_comp/

This list would be useful to sticky.

I agree!

for ( int iSample = 0; iSample < 1000; iSample++ )

{

   for ( int i = -val; i < val; i++ )

   {

	  pRes[ iSample + i ] += someValue * i;   (**)

   }

}

You can always find parallelism in “2*val” interval… and then do a sliding window…

pRes[ ( rand() % 1000 ) + i ] += someValue * i;

Do the same here… Any result deviated by race condition can be explained as equivalent to another sequential algorithm that ran with a different random seed…