Any good ideas for this special "reduction" ?

Philipp82 · November 9, 2009, 2:52pm

Hi,

This is my problem:
I got a very long array A of float values (eg. 1Mio, or 10Mio). All float values are randomly in the range of [0…N]. N can be about 1000 - 4000.
I have a second array B , this one is much smaller, with just N floats. All values of the first array have to be “put into” the second array, with linear weighting.

That means for example:
One element of the array A is “100.2”. So element “100” of the second array B is incremented by 0.8 and element “101” is incremented by 0.2.
Another element of the array A is “132.5”. So element “132” of the second array B is incremented by 0.5 and element “133” is incremented by 0.5.
…
Until all 10Mio Elements of array A are done.

Does anyone have a good idea, how to manage this (fast) with cuda?

Thanks for any help!

LSChien · November 10, 2009, 1:26am

I suggest that each thread block deal with a chunk of array A and store result into shared memory, finally

use atomic operation to write data in shared memory to global memory B.

since you need to decrease number of atomic operations, so one thread block can deal with more data.

for example, if thread block = 512, then it can deal with 4K data or more.

Cygnus_X1 · November 10, 2009, 3:18am

One strange idea just popped into my head. I don’t claim it is the best and fastest, but it is a start…

We start by sorting the A array (not that long after all, there are good parallel sorting algorithms)
For each element i of our array A we split a number to integer part I_i and factorial part F_i.
Then we perform a conditional perfix sum. In conditional prefix sum we add two values only if certain condition is met. In our case for numbers N_p and N_q:
if (I_p == I_q)
we add up F_p and F_q.

Numbers consiting of the same integer part N lie all in a single block without holes (because we performed the sort), therefore after the conditional prefix sum, the first element of the block will hold the sum of all factorials of numbers of given integer part: sum(F(N))

Now we should know the following values:
#I(N) - number of numbers of integer part equal to N
sum(F(N)) - sum of all factoral parts of numbers which integer part is equal to N.

so finally
your array B[N] = #I(N) - sum(F(N)) + sum(F(N-1))

Sarnath · November 10, 2009, 7:02am

My first thought was sorting… similar to Cygnus… But now that he has given a solution, let me talk something else…

IDEA 1

Assume you start with “N” blocks. Each block would update a private 1000-4000 range array. Thus your array “B” needs to be replicated “N” times.
For 100 blocks, you would need 100 arrays of size of “B”.
Divide the 10milion numbers among the N blocks. Each block would update its private B array
Then, you need to reduce these private B arrays onto 1 single array.

IDEA 1++

But then, it still requires atomic operations inside a block…

Extending the idea further, we could have private arrays each thread. On a TESLA C1060, you would need 30*192 threads minimum to keep your latencies away. So, that would mean 5760 arrays of 3000 floats (for 1000-4000 range) each. That boils down to 68 MB. So, you could be fine until your range is kept minimum.
At the end, you need to reduce all these 5760 arrays.

IDEA 2

Assign each element in the range to 1 thread. Each thread will scan all the array elements and increment a local counter (which can be thread local register) and finally write it down to the B array.
If your B array is very very less, then you could assign multiple threads to do the job and do a reduction finally.
Threads in a block can load shared memory with some data and all threads can scan all of data and increment their counters.
Memory bandwidth would be an issue because all threads would need to scan everybody… This would work good in architectures like FERMI, I would guess (because of large fat common caches )

Good Luck,

Philipp82 · November 10, 2009, 7:28am

Hi all,

Thanks for your answers!!!
Doing all that on my CPU takes about 150ms for 10 Mio elements in array A, so that’s what I have to beat. Sorting takes time of the same scale I guess, so that is not an option. I will try your recommendations!

Better ways are still welcome :)

Philipp.

Philipp82 · November 10, 2009, 10:14am

Hi again,

I tryed one thing with an atomic float add function that I found in this forum, which I use on shared Memory.
Now my thread blocks have their own array B in shared memory. Can anyone explain me how I can sum those shared mem arrays on the global memory B field?

Philipp.

Philipp82 · November 10, 2009, 12:36pm

My result: From 140ms to 5,7ms

I do the following:
-Every thread-block has a shared memory array B.
-Every thread handles about 1000 elements of array A
-All elements of a block are sorted into the shared mem array via atomic float add functions
-at the end of my kernel the shared mem array is added to the overall array via atomic float add

LSChien · November 10, 2009, 1:29pm

I take TeslaC1060 as an example, it has 30 SM.

if you want to use all the resource, then 30 thread blocks must be used at least.

suppose you use 30 thread blocks and each thread block has 512 threads, and 10M data elements ,

then each trhead deals with 651 data elements.

at the end, you must write share memory to global B array via atomic operation.

this means that you need to write N= 1000 to array B 30 times serially.

I think that if array A can be reused, then you can copy shared memory into array A and

merge array A to array B via bisection method.

inside a thread block, I think that you can use additional integer array count[1:N]

and float B_shared[1:N] (suppose N is not large such that count and B_shared can be put into shared memory)

and add data into corresponding entry of B_shared, say

100.2 —> add to B_shared[100], count[100]++

100.3 —> add to B_shared[100], count[100]++

finally, correct weight in B_shared, then it can reduce number of atomic operation.

coggrc · November 18, 2009, 10:32pm

those sound like great results… i would like to do something very similar… would you be willing to post your kernel here? also, which atomic float add method are you using? please post that as well, if you would,

thanks

gshi · November 19, 2009, 3:14pm

The Idea 2 sounds great. If texture is used for array A, then it can broadcast each segment of A to all SMs.

Array A is effectively loaded only once.

I would like to see the idea 2 implemented and see what the peformance is.

-gshi

My first thought was sorting… similar to Cygnus… But now that he has given a solution, let me talk something else…

IDEA 1

Assume you start with “N” blocks. Each block would update a private 1000-4000 range array. Thus your array “B” needs to be replicated “N” times.

For 100 blocks, you would need 100 arrays of size of “B”.

Divide the 10milion numbers among the N blocks. Each block would update its private B array

Then, you need to reduce these private B arrays onto 1 single array.

IDEA 1++

But then, it still requires atomic operations inside a block…

Extending the idea further, we could have private arrays each thread. On a TESLA C1060, you would need 30*192 threads minimum to keep your latencies away. So, that would mean 5760 arrays of 3000 floats (for 1000-4000 range) each. That boils down to 68 MB. So, you could be fine until your range is kept minimum.

At the end, you need to reduce all these 5760 arrays.

IDEA 2

Assign each element in the range to 1 thread. Each thread will scan all the array elements and increment a local counter (which can be thread local register) and finally write it down to the B array.

If your B array is very very less, then you could assign multiple threads to do the job and do a reduction finally.

Threads in a block can load shared memory with some data and all threads can scan all of data and increment their counters.

Memory bandwidth would be an issue because all threads would need to scan everybody… This would work good in architectures like FERMI, I would guess (because of large fat common caches )

Good Luck,

Sarnath · November 20, 2009, 7:21am

Oh yes! Textures! Thats a good point gshi!

Topic		Replies	Views
[SOLVED] Code his own shared memory with device memory! CUDA Programming and Performance	15	2552	October 7, 2015
Slow Performance CUDA Programming and Performance	19	8123	November 24, 2008
Help with memory management CUDA Programming and Performance	20	5765	March 27, 2010
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6674	June 2, 2012
Efficient use of shared memory CUDA Programming and Performance	29	4171	December 2, 2019
How to optimize pairwise calculations? CUDA Programming and Performance	11	2703	July 22, 2010
Best way to find many minimums CUDA Programming and Performance	8	2263	January 3, 2018
matrix multiply reduction CUDA Programming and Performance	41	35525	January 15, 2011
local block thread index and global thread index using __shared__ variables ? CUDA Programming and Performance	4	2948	July 12, 2014
Dynamically restricting # of threads CUDA Programming and Performance	8	6382	August 3, 2008

Any good ideas for this special "reduction" ?

IDEA 1

IDEA 1++

IDEA 2

Related topics