After looking at the scan / reduction examples in the SDK, I think I understand how to apply these techniques to computing the (index,value) of the peak element of a 1D array.

Now I have to complicate the problem in two steps:

First, how do I modify this if instead of just finding one peak, I need to report the n largest peaks?

Second (and this may take a bit of explaining): Suppose we have the output of the FFT of some signal. There are peaks of various magnitudes and widths at different frequency locations. If I apply a threshold to this, I can segment the signal into some number of subsections that are above/below the threshold. Then, for each above-threshold subsection that is “wide-enough” (above threshold for > some number of frequency bins), I want to compute the local peak. In other words: I want to find all peaks in the signal that are above a certain threshold and wider than a certain other threshold.

All of this is easy to do in a serial fashion, but I’m having trouble making it parallelizeable for a GPU. Advice greatly appreciated!