Compute Cumulative Frequency


Can we compute cumulative frequency in CUDA.

Suppose if i have array[10] and i want to calculate the cumulative array where array[i]=array[i-1]+ some_constant.

In a single block how can i achieve this??

as all the threads will run in parallel so is it possible for a thread (i.e. 2) to wait for thread (1) to compute its value and use it.

is there any way this can be done faster than in CPU itself?

The sort of parallel prefix sum you want to perform can be done in CUDA. Have a look at the scan example and paper in the SDK.

That sounds like the worst possible candidate for a GPU implementation and the definition of a serial algorithm.
Since each thread i needs the work of thread i-1 to be finished before it can start its work, i dont see any way to speed this up. When thread i is done with its work, itll just idle?

Looks to me like only on thread will be doing any actual work at any given time, in which case you might as well run it on the CPU.

But who knows, i could be wrong!

The key here is that the binary operation between elements is associative (like addition). This lets you compute a bunch of partial sums in parallel and reuse them. The GPU can be very beneficial in this case (though more due to its enormous memory bandwidth).

The big issue here is that with only 10 elements, a prefix scan on the GPU is not worth it. Nearly the entire GPU will be idle, and the kernel launch overhead will be vastly greater than the actual runtime. If you needed to prefix scan a million elements (especially if they were already in the device memory), then the GPU would do a good job.

10 element was just some dummy number for example. But still my array wont be in millions and as per the paper that i studied regarding Parallel Prefix sum ( seems like my application wont benefit much from it or it will degrade.

Thank you very much for the information

Sure i agree with that, but the way he presents the problem, i thought all elements from the array were populated from the first element of the array.

I agreee that it can be sped up if every element in the array already has a value, which is to say that value[n] can be computed at any time since value[n-1] is accessible at kernel launch.

The way i understood the problem (which might very well be completly wrong) is that only the first value in the array has an initial value and that every other element is cascaded from that first element.

-well not that i think about it for more than 3 secs, my understanding of the problem made no sense… so carry on and ignore my initial post!