The most important parallel algorithms to teach students

What do you consider the most important parallel algorithms a student needs to learn (regardless of language)?

A list for starters. Variations on these:

  • Reduction
  • Scan
  • Merge Sort

What else should be added and why?

Since GPU programming is relatively new, I think what most people want to see is side-by-side GPU and CPU implementations of popular algorithms.

Most papers that are published either;

  1. Do not include any source code, rather just talk theoretically about the implementation in very high-level vague details

Example(This one is really a waste, no useful content at all other than ‘We implemented a dynamic programming algorithm and it was great!’);

https://docs.google.com/viewer?url=http://www.ijcsmr.org/vol2issue4/paper325.pdf

  1. It does include pseudo-code portions of the algorithm, but not enough to reconstruct the work in a real-world way.

Example;

https://docs.google.com/viewer?url=http://www.danielbit.com/~dali/papers/Graph_GPU_Adaptive_IPDPS13.pdf

  1. Or they do include the source which is spread out over multiple header/source files and then abuse templates to the point where not even a 10-year veteran can make sense of it all.

Example;

https://code.google.com/p/back40computing/source/browse/#svn%2Ftrunk%2Fb40c%2Fgraph%2Fbfs

When I post CUDA code to Github I always include an easy to follow CPU implementation along side with the GPU implementation.

Less than 20% of the commonly found papers/blogs/articles are informative enough to use the content.

The bottom line is that GPU programming is tricky, and the best way to learn is to see clear examples of implementations of popular algorithms.

Nvidia spends at lot of energy designing good GPUs, but need to show new users how to use these features. Not everybody has the time to learn via trial-and-error.

Scans, reductions and merge sort have been discussed over and over.

How about clear examples of algorithms such as A*, optimized BFS, discrete knapsack(0/1) , bipartite matching, convex-hull, Hungarian assignment, computing eigen values, etc.

I will add Convolution is everywhere in DSP , another one Matrix Multiplication because is Matrix Multiplication.

I do agree with CudaaduC more traditional algorithms , greede, divide/conquer etc. In other words write a book in Analysis of Algorithms using real parallel architectures using CUDA

It is also important to realize that many people who use CUDA, are going to be using Thrust, cuBLAS, cuSPARSE to solve problems.

Thrust’s vectors are slow, so there needs to be more examples of using the their device pointers(to device memory allocated in the traditional way) for scans(like max_element), sort etc. When you use their device pointers to device memory thrust::sort is fast as hell.

There should be clear examples of the tricky issues like making the adjustment to column-major format(this trips a lot of programmers up), and the difference sparse formats which are available and how they are stored (cuSPARSE).

Keep in mind most of those Linear Algebra function take in 9-14 parameters, and if you get one wrong horrible things can happen. Varied clear examples always help.

Also include matrix multiplication examples when rows!=columns, because too often square matrices are used and that does not help distinguish between the row-major to col-major adjustment.