 # Any good ideas for this special "reduction" ?

Hi,

This is my problem:
I got a very long array A of float values (eg. 1Mio, or 10Mio). All float values are randomly in the range of [0…N]. N can be about 1000 - 4000.
I have a second array B , this one is much smaller, with just N floats. All values of the first array have to be “put into” the second array, with linear weighting.

That means for example:
One element of the array A is “100.2”. So element “100” of the second array B is incremented by 0.8 and element “101” is incremented by 0.2.
Another element of the array A is “132.5”. So element “132” of the second array B is incremented by 0.5 and element “133” is incremented by 0.5.

Until all 10Mio Elements of array A are done.

Does anyone have a good idea, how to manage this (fast) with cuda?

Thanks for any help!

I suggest that each thread block deal with a chunk of array A and store result into shared memory, finally

use atomic operation to write data in shared memory to global memory B.

since you need to decrease number of atomic operations, so one thread block can deal with more data.

for example, if thread block = 512, then it can deal with 4K data or more.

One strange idea just popped into my head. I don’t claim it is the best and fastest, but it is a start…

1. We start by sorting the A array (not that long after all, there are good parallel sorting algorithms)
2. For each element i of our array A we split a number to integer part I_i and factorial part F_i.
3. Then we perform a conditional perfix sum. In conditional prefix sum we add two values only if certain condition is met. In our case for numbers N_p and N_q:
if (I_p == I_q)
we add up F_p and F_q.

Numbers consiting of the same integer part N lie all in a single block without holes (because we performed the sort), therefore after the conditional prefix sum, the first element of the block will hold the sum of all factorials of numbers of given integer part: sum(F(N))

1. Now we should know the following values:
#I(N) - number of numbers of integer part equal to N
sum(F(N)) - sum of all factoral parts of numbers which integer part is equal to N.

so finally
your array B[N] = #I(N) - sum(F(N)) + sum(F(N-1))

My first thought was sorting… similar to Cygnus… But now that he has given a solution, let me talk something else…

## IDEA 1

1. Assume you start with “N” blocks. Each block would update a private 1000-4000 range array. Thus your array “B” needs to be replicated “N” times.
For 100 blocks, you would need 100 arrays of size of “B”.

2. Divide the 10milion numbers among the N blocks. Each block would update its private B array

3. Then, you need to reduce these private B arrays onto 1 single array.

## IDEA 1++

But then, it still requires atomic operations inside a block…

Extending the idea further, we could have private arrays each thread. On a TESLA C1060, you would need 30*192 threads minimum to keep your latencies away. So, that would mean 5760 arrays of 3000 floats (for 1000-4000 range) each. That boils down to 68 MB. So, you could be fine until your range is kept minimum.
At the end, you need to reduce all these 5760 arrays.

## IDEA 2

Assign each element in the range to 1 thread. Each thread will scan all the array elements and increment a local counter (which can be thread local register) and finally write it down to the B array.
If your B array is very very less, then you could assign multiple threads to do the job and do a reduction finally.
Threads in a block can load shared memory with some data and all threads can scan all of data and increment their counters.
Memory bandwidth would be an issue because all threads would need to scan everybody… This would work good in architectures like FERMI, I would guess (because of large fat common caches )

Good Luck,

Hi all,

Doing all that on my CPU takes about 150ms for 10 Mio elements in array A, so that’s what I have to beat. Sorting takes time of the same scale I guess, so that is not an option. I will try your recommendations!

Better ways are still welcome :)

Philipp.

Hi again,

I tryed one thing with an atomic float add function that I found in this forum, which I use on shared Memory.
Now my thread blocks have their own array B in shared memory. Can anyone explain me how I can sum those shared mem arrays on the global memory B field?

Philipp.

My result: From 140ms to 5,7ms

I do the following:
-Every thread-block has a shared memory array B.
-All elements of a block are sorted into the shared mem array via atomic float add functions
-at the end of my kernel the shared mem array is added to the overall array via atomic float add

I take TeslaC1060 as an example, it has 30 SM.

if you want to use all the resource, then 30 thread blocks must be used at least.

suppose you use 30 thread blocks and each thread block has 512 threads, and 10M data elements ,

then each trhead deals with 651 data elements.

1. at the end, you must write share memory to global B array via atomic operation.

this means that you need to write N= 1000 to array B 30 times serially.

I think that if array A can be reused, then you can copy shared memory into array A and

merge array A to array B via bisection method.

1. inside a thread block, I think that you can use additional integer array count[1:N]

and float B_shared[1:N] (suppose N is not large such that count and B_shared can be put into shared memory)

and add data into corresponding entry of B_shared, say

100.2 —> add to B_shared, count++

100.3 —> add to B_shared, count++

finally, correct weight in B_shared, then it can reduce number of atomic operation.

those sound like great results… i would like to do something very similar… would you be willing to post your kernel here? also, which atomic float add method are you using? please post that as well, if you would,

thanks

The Idea 2 sounds great. If texture is used for array A, then it can broadcast each segment of A to all SMs.

Array A is effectively loaded only once.

I would like to see the idea 2 implemented and see what the peformance is.

-gshi

Oh yes! Textures! Thats a good point gshi!