Thinking parallel

Ok, so I’m finally starting to get the hang of CUDA and thinking massively parallel. I’m very pleased with my results so far, but I can’t help thinking about the parts of my programs that seem to force serial processing… there must be some way to make them parallel too. Probably there isn’t but I thought I’d mention 2 common tasks in case someone has some ideas… I’ll even mention my ideas too.

Common task 1. Summing the elements in a vector
Common task 2. Find the largest (or smallest) element in a vector

So far the best I can come up with is to sum or compare consecutive pairs of elements in the vector (in parallel) thus creating a new vector 1/2 the size of the original, then repeating the process. This will half the time it takes to do these tasks (roughly due to increased memory use). Also the way I’m doing this means that on each consecutive iteration half of the previously active threads are doing nothing which seems wasteful especially as there are more threads than GPU-CUDA-cores.

I would get a similar speed up by just summing one half sequentially and the other sequentially (but in parallel) and then summing the results of each, and thus avoid the wasted threads.

Anyone got a better way to do this? or comments on this? or even other common code tasks that seem to resist parallel coding External Image

Guy Blelloch’s book from 20 years ago (!) is a good place to start. It remains my favorite in-depth treatment of scan.

[url=“http://www.cs.cmu.edu/~blelloch/papers/Ble90.pdf”]http://www.cs.cmu.edu/~blelloch/papers/Ble90.pdf[/url]

Just a few thoughts I would like to add:

All kinds of “cumulative operations” (stateful operations?) pose a threat to parallelization.

But “reductions” and “scans” are used to remove this parallel-hazard – whenever the cumultion operator is ASSOCIATIVE.

i.e. while doing 1+2+3+…1000 - one could do 1+2 , 3+4, 5+6,…999+1000 } - 500 parallel operations leading to 250 parallel operations leading to 125 operations and so on … (As you have rightly pointed out). This is called “reduction” and can be applied whenever the cumulation operator is “ASSOCIATIVE”.

Similarly non-markov operations (i.e. stateful operations) pose a parallel-threat. Depending on what the algorithm remembers, the algorithm might or might not be parallelizable. Markov operations (like random walk) can be easily parallelized - for example - monte-carlo simulations.

Once you get the hang of the concepts, you should definite check out the thrust library:

[url=“Google Code Archive - Long-term storage for Google Code Project Hosting.”]Google Code Archive - Long-term storage for Google Code Project Hosting.

Which implements most of these common algorithms in CUDA for you.

Cool thanks thrust looks quite useful.

It’s strange though, if I do matrix-vector multiplication in a single kernel of N threads (matrix is NN) including the sum, it takes the same time as running a kernel of NN threads just doing the multiplication part. The idea then would be to do the summation part separately either as mentioned earlier or using thrust. I’m timing both in code and also with the CUDA profiler.

You may also want to look at:
[url=“http://www.cse.chalmers.se/~billeter/papers.html”]http://www.cse.chalmers.se/~billeter/papers.html[/url]
and their code (which is hard to read but is damn fast!)
[url=“http://www.cse.chalmers.se/~billeter/pub/pp/index.html”]http://www.cse.chalmers.se/~billeter/pub/pp/index.html[/url]

which includes compaction, but also prefix operations (sum all elements or make each element equal to all preceeding elements) and sorting (radix sort)