How to parallelize this 1D problem? Applying scan/reduction techniques

After looking at the scan / reduction examples in the SDK, I think I understand how to apply these techniques to computing the (index,value) of the peak element of a 1D array.

Now I have to complicate the problem in two steps:

First, how do I modify this if instead of just finding one peak, I need to report the n largest peaks?

Second (and this may take a bit of explaining): Suppose we have the output of the FFT of some signal. There are peaks of various magnitudes and widths at different frequency locations. If I apply a threshold to this, I can segment the signal into some number of subsections that are above/below the threshold. Then, for each above-threshold subsection that is “wide-enough” (above threshold for > some number of frequency bins), I want to compute the local peak. In other words: I want to find all peaks in the signal that are above a certain threshold and wider than a certain other threshold.

All of this is easy to do in a serial fashion, but I’m having trouble making it parallelizeable for a GPU. Advice greatly appreciated!

Yes, finding the largest element is just a reduction with the “max” operator.

To find the next largest value you could just do another max reduction, but ignoring the first max value.

If you want to find the n largest values for a large n, you might be better off just sorting them (using the radix sort in the SDK, for example).

A small example for the second step of the problem (I want to do this with BIG data sets):

threshold: 105

min peak width: 3

index   value   (threshold state)

0		100	 0

1		120	 1

2		130	 1

3		120	 1

4		90	  0

5		100	 0

6		140	 1

7		150	 1

8		100	 0

9		120	 1

10	   130	 1

11	   120	 1

12	   140	 1

13	   120	 1

As the third column shows, there are 3 subsections of the signal that are above the threshold. Globally, index 7 is the highest peak, but since that subsection is only 2 wide, we have to throw it away.

Desired output of kernel:

peaks at (index=2, value=130) and (index=12, value=140)

Hopefully that makes sense. Is this problem a good candidate for GPU? or not parallelizable enough?

However you asked too… “how do I modify this if instead of just finding one peak, I need to report the n largest peaks?” - these are 2 different algorithms.

If you know your thresholds in advance, your problem is MUCH simpler - since you do not need to compare all the element each-other but just to the thresholds… this is definitely simpler either algorithmically (O(N) vs at best O(N log N)) and “cudically”.

I would split your algo like this…

  1. signal thresholding: mask[i] = singal[i] > Vthreshold ? 1 : 0;

  2. 1D erosion: mask2[i] = mask[i-1] & mask[i] & mask[i+1]; //reiterate

  3. as many steps as Hthreshold/2

  4. find the survived “1s” - these are your peaks.

[I edited this - the tree algo coming as 1st idea is not applicable, since it could merge different peaks]

OK, this is just a rough idea come in.

Is it worthwile a GPU? Since it is O(N), this will be bandwidth limited. However if you eval your FFT on the GPU, continuing to work on the GPU an transferring just the peaks to the HOST… is really an effective compression!

I Hope it helps!

Hi sigismondo-

Thanks for your thoughts

You are right – I was considering them as two different problems, should have been clearer.

I like the idea of iterated erosion, this parallelizes nicely. However, does step 4 require a linear search (O(n))? At the end you are left with an array of mostly zeros with a very occasional one. How do you parallelize the finding of surviving peaks?

Hi,

yes, it is O(N) as well - finding the peaks is a local operation.

To find them I would follow a similar approach to erosion, but with an operator that leaves 1- or 2-width peaks - to exploit cuda parallelism. Nothing coming to my mind, so I explain with "if"s…

mask2[i] = !mask[i-1] & mask[i] & !mask[i+1] ? 1 : (mask[i-1] & mask[i] & mask[i+1])

You need to take care also of 2-width peaks… or they will be eroded away, so it becomes

mask2[i] = !mask[i-1] & mask[i] & mask[i+1] & !mask[i+2] ? 1 : “as before”

Actually these operations does not give origin to high divergence - but I believe finding the correct logic operator would be better.

Then you need to loop over until all the peaks are 1-width - so you need a global reduction, or __all/__any if you are on an 1.2 architecture.

Then you need to collect them - you can use atomics to keep a global counter and let each thread save its 1s positions to global memory.

One note… These are just ideas I think valid. I come from a long experience with “traditional” parallel computing, but with cuda, I am learning as well! So, please come back with feedback, even (expecially!) if you or any will find a different better approach.

[edited]

strange machine the brain… the “erosion operator that leaves 1-width peaks” should be

mask2[i] = !(mask[i-1] ^ mask[i+1]) & mask[i]

I havent tested it: give it a try - i will this evening

May b, you want to just find the average first and sort only the relevant elements. (helpful if N is large, no?)

Hi. I have 60 buffers with 2048 frequency values ​​(fft) How do I find the threshold? You said something about comparing each with each..,

2048 values is not a huge number, it fits into shared memory or L1 cache. Try to implement it in a way that data does not have to be read repeatedly from global device memory or even L2 memory.

So either just go through the data once and keep the few largest you need (e.g. in statically named registers), or use shared memory or use the L1 cache with a limited number of threads.

Global device memory accesses are much slower than your actual computations.

You want to find i largest values?