This is my problem:
I got a very long array A of float values (eg. 1Mio, or 10Mio). All float values are randomly in the range of [0…N]. N can be about 1000 - 4000.
I have a second array B , this one is much smaller, with just N floats. All values of the first array have to be “put into” the second array, with linear weighting.
That means for example:
One element of the array A is “100.2”. So element “100” of the second array B is incremented by 0.8 and element “101” is incremented by 0.2.
Another element of the array A is “132.5”. So element “132” of the second array B is incremented by 0.5 and element “133” is incremented by 0.5.
…
Until all 10Mio Elements of array A are done.
Does anyone have a good idea, how to manage this (fast) with cuda?
One strange idea just popped into my head. I don’t claim it is the best and fastest, but it is a start…
We start by sorting the A array (not that long after all, there are good parallel sorting algorithms)
For each element i of our array A we split a number to integer part I_i and factorial part F_i.
Then we perform a conditional perfix sum. In conditional prefix sum we add two values only if certain condition is met. In our case for numbers N_p and N_q:
if (I_p == I_q)
we add up F_p and F_q.
Numbers consiting of the same integer part N lie all in a single block without holes (because we performed the sort), therefore after the conditional prefix sum, the first element of the block will hold the sum of all factorials of numbers of given integer part: sum(F(N))
Now we should know the following values: #I(N) - number of numbers of integer part equal to N
sum(F(N)) - sum of all factoral parts of numbers which integer part is equal to N.
so finally
your array B[N] = #I(N) - sum(F(N)) + sum(F(N-1))
My first thought was sorting… similar to Cygnus… But now that he has given a solution, let me talk something else…
IDEA 1
Assume you start with “N” blocks. Each block would update a private 1000-4000 range array. Thus your array “B” needs to be replicated “N” times.
For 100 blocks, you would need 100 arrays of size of “B”.
Divide the 10milion numbers among the N blocks. Each block would update its private B array
Then, you need to reduce these private B arrays onto 1 single array.
IDEA 1++
But then, it still requires atomic operations inside a block…
Extending the idea further, we could have private arrays each thread. On a TESLA C1060, you would need 30*192 threads minimum to keep your latencies away. So, that would mean 5760 arrays of 3000 floats (for 1000-4000 range) each. That boils down to 68 MB. So, you could be fine until your range is kept minimum.
At the end, you need to reduce all these 5760 arrays.
IDEA 2
Assign each element in the range to 1 thread. Each thread will scan all the array elements and increment a local counter (which can be thread local register) and finally write it down to the B array.
If your B array is very very less, then you could assign multiple threads to do the job and do a reduction finally.
Threads in a block can load shared memory with some data and all threads can scan all of data and increment their counters.
Memory bandwidth would be an issue because all threads would need to scan everybody… This would work good in architectures like FERMI, I would guess (because of large fat common caches )
Thanks for your answers!!!
Doing all that on my CPU takes about 150ms for 10 Mio elements in array A, so that’s what I have to beat. Sorting takes time of the same scale I guess, so that is not an option. I will try your recommendations!
I tryed one thing with an atomic float add function that I found in this forum, which I use on shared Memory.
Now my thread blocks have their own array B in shared memory. Can anyone explain me how I can sum those shared mem arrays on the global memory B field?
I do the following:
-Every thread-block has a shared memory array B.
-Every thread handles about 1000 elements of array A
-All elements of a block are sorted into the shared mem array via atomic float add functions
-at the end of my kernel the shared mem array is added to the overall array via atomic float add
those sound like great results… i would like to do something very similar… would you be willing to post your kernel here? also, which atomic float add method are you using? please post that as well, if you would,