After looking at the scan / reduction examples in the SDK, I think I understand how to apply these techniques to computing the (index,value) of the peak element of a 1D array.
Now I have to complicate the problem in two steps:
First, how do I modify this if instead of just finding one peak, I need to report the n largest peaks?
Second (and this may take a bit of explaining): Suppose we have the output of the FFT of some signal. There are peaks of various magnitudes and widths at different frequency locations. If I apply a threshold to this, I can segment the signal into some number of subsections that are above/below the threshold. Then, for each above-threshold subsection that is “wide-enough” (above threshold for > some number of frequency bins), I want to compute the local peak. In other words: I want to find all peaks in the signal that are above a certain threshold and wider than a certain other threshold.
All of this is easy to do in a serial fashion, but I’m having trouble making it parallelizeable for a GPU. Advice greatly appreciated!
As the third column shows, there are 3 subsections of the signal that are above the threshold. Globally, index 7 is the highest peak, but since that subsection is only 2 wide, we have to throw it away.
Desired output of kernel:
peaks at (index=2, value=130) and (index=12, value=140)
Hopefully that makes sense. Is this problem a good candidate for GPU? or not parallelizable enough?
However you asked too… “how do I modify this if instead of just finding one peak, I need to report the n largest peaks?” - these are 2 different algorithms.
If you know your thresholds in advance, your problem is MUCH simpler - since you do not need to compare all the element each-other but just to the thresholds… this is definitely simpler either algorithmically (O(N) vs at best O(N log N)) and “cudically”.
[I edited this - the tree algo coming as 1st idea is not applicable, since it could merge different peaks]
OK, this is just a rough idea come in.
Is it worthwile a GPU? Since it is O(N), this will be bandwidth limited. However if you eval your FFT on the GPU, continuing to work on the GPU an transferring just the peaks to the HOST… is really an effective compression!
You are right – I was considering them as two different problems, should have been clearer.
I like the idea of iterated erosion, this parallelizes nicely. However, does step 4 require a linear search (O(n))? At the end you are left with an array of mostly zeros with a very occasional one. How do you parallelize the finding of surviving peaks?
yes, it is O(N) as well - finding the peaks is a local operation.
To find them I would follow a similar approach to erosion, but with an operator that leaves 1- or 2-width peaks - to exploit cuda parallelism. Nothing coming to my mind, so I explain with "if"s…
Actually these operations does not give origin to high divergence - but I believe finding the correct logic operator would be better.
Then you need to loop over until all the peaks are 1-width - so you need a global reduction, or __all/__any if you are on an 1.2 architecture.
Then you need to collect them - you can use atomics to keep a global counter and let each thread save its 1s positions to global memory.
One note… These are just ideas I think valid. I come from a long experience with “traditional” parallel computing, but with cuda, I am learning as well! So, please come back with feedback, even (expecially!) if you or any will find a different better approach.
[edited]
strange machine the brain… the “erosion operator that leaves 1-width peaks” should be
mask2[i] = !(mask[i-1] ^ mask[i+1]) & mask[i]
I havent tested it: give it a try - i will this evening
2048 values is not a huge number, it fits into shared memory or L1 cache. Try to implement it in a way that data does not have to be read repeatedly from global device memory or even L2 memory.
So either just go through the data once and keep the few largest you need (e.g. in statically named registers), or use shared memory or use the L1 cache with a limited number of threads.
Global device memory accesses are much slower than your actual computations.