Ok, so I’m finally starting to get the hang of CUDA and thinking massively parallel. I’m very pleased with my results so far, but I can’t help thinking about the parts of my programs that seem to force serial processing… there must be some way to make them parallel too. Probably there isn’t but I thought I’d mention 2 common tasks in case someone has some ideas… I’ll even mention my ideas too.
Common task 1. Summing the elements in a vector
Common task 2. Find the largest (or smallest) element in a vector
So far the best I can come up with is to sum or compare consecutive pairs of elements in the vector (in parallel) thus creating a new vector 1/2 the size of the original, then repeating the process. This will half the time it takes to do these tasks (roughly due to increased memory use). Also the way I’m doing this means that on each consecutive iteration half of the previously active threads are doing nothing which seems wasteful especially as there are more threads than GPU-CUDA-cores.
I would get a similar speed up by just summing one half sequentially and the other sequentially (but in parallel) and then summing the results of each, and thus avoid the wasted threads.
Anyone got a better way to do this? or comments on this? or even other common code tasks that seem to resist parallel coding