That sounds like the worst possible candidate for a GPU implementation and the definition of a serial algorithm.
Since each thread i needs the work of thread i-1 to be finished before it can start its work, i dont see any way to speed this up. When thread i is done with its work, itll just idle?
Looks to me like only on thread will be doing any actual work at any given time, in which case you might as well run it on the CPU.
The key here is that the binary operation between elements is associative (like addition). This lets you compute a bunch of partial sums in parallel and reuse them. The GPU can be very beneficial in this case (though more due to its enormous memory bandwidth).
The big issue here is that with only 10 elements, a prefix scan on the GPU is not worth it. Nearly the entire GPU will be idle, and the kernel launch overhead will be vastly greater than the actual runtime. If you needed to prefix scan a million elements (especially if they were already in the device memory), then the GPU would do a good job.
Sure i agree with that, but the way he presents the problem, i thought all elements from the array were populated from the first element of the array.
I agreee that it can be sped up if every element in the array already has a value, which is to say that value[n] can be computed at any time since value[n-1] is accessible at kernel launch.
The way i understood the problem (which might very well be completly wrong) is that only the first value in the array has an initial value and that every other element is cascaded from that first element.
-well not that i think about it for more than 3 secs, my understanding of the problem made no sense… so carry on and ignore my initial post!