How to do Parallel Reduction of many unequally sized arrays in CUDA?

zenna · November 20, 2009, 11:43pm

Hi there

I am wondering if anyone could suggest the best approach to computing the mean / standard deviation of a large number of relatively small but differently sized arrays in CUDA?

The parallel reduction example in the SDK works on a single very large array and it seems the size is conveniently a multiple of the number of threads per block… but my case is rather different:

Conceptually I however have a large number of objects which each contain two components, upper and lower and each of these components has an x and a y coordinate. i.e. upper.x, lower.y, upper.x, lower.y

Each of these arrays is approximately 800 in length but it varies between objects (not within an object) e.g.

Object1.lower.x = 1.1, 2.2, 3.3
Object1.lower.y = 4.4, 5.5, 6.6
Object1.upper.y = 7.7, 8.8, 9.9
Object1.upper.y = 1.1, 2.2, 3.3

Object2.model.x = 1.0, 2.0, 3.0, 4.0, 5.0
Object2.model.y = 6.0, 7.0, 8.0, 9.0, 10.0
Object2.image.y = 11.0, 12.0, 13.0, 14.0, 15.0
Object2.image.y = 16.0, 17.0, 18.0, 19.0, 20.0

Please note the above is just my way of representing the array and my data is not stored in C structs or anything like that, the data can be organised in whatever way I need. The point is that for each array the mean, standard deviation and eventually a histogram needs to be computed and within one particular object, ratios and differences between arrays need to be computed.

So how should I go about sending this data to the GPU device and organising my thread-block hierarchy? One idea I had was to zero pad all my arrays so they are of the same length, and have a group of blocks working on each object… but it seems there are all sorts of problems with that method if it would work at all.

Thanks in advance

nbell · November 24, 2009, 6:36pm

Hi there

I am wondering if anyone could suggest the best approach to computing the mean / standard deviation of a large number of relatively small but differently sized arrays in CUDA?

The parallel reduction example in the SDK works on a single very large array and it seems the size is conveniently a multiple of the number of threads per block… but my case is rather different:

Conceptually I however have a large number of objects which each contain two components, upper and lower and each of these components has an x and a y coordinate. i.e. upper.x, lower.y, upper.x, lower.y

Each of these arrays is approximately 800 in length but it varies between objects (not within an object) e.g.

Object1.lower.x = 1.1, 2.2, 3.3

Object1.lower.y = 4.4, 5.5, 6.6

Object1.upper.y = 7.7, 8.8, 9.9

Object1.upper.y = 1.1, 2.2, 3.3

Object2.model.x = 1.0, 2.0, 3.0, 4.0, 5.0

Object2.model.y = 6.0, 7.0, 8.0, 9.0, 10.0

Object2.image.y = 11.0, 12.0, 13.0, 14.0, 15.0

Object2.image.y = 16.0, 17.0, 18.0, 19.0, 20.0

Please note the above is just my way of representing the array and my data is not stored in C structs or anything like that, the data can be organised in whatever way I need. The point is that for each array the mean, standard deviation and eventually a histogram needs to be computed and within one particular object, ratios and differences between arrays need to be computed.

So how should I go about sending this data to the GPU device and organising my thread-block hierarchy? One idea I had was to zero pad all my arrays so they are of the same length, and have a group of blocks working on each object… but it seems there are all sorts of problems with that method if it would work at all.

Thanks in advance

What you want to do can be implemented with a “segmented scan” or “segmented reduction” operation. Here’s a good paper that describes the operation.

Segmented scan is the more common of the two operations and you can find implementations in Thrust (example here) and CUDPP.

Segmented reduction is less common but more directly solves the problem you’re after. You can find one implementation in our Sparse Matrix-Vector Multiplication code for the COO matrix format.

Since you say that your objects have ~800 values each you might also consider a different approach that resembles theCSR (vector) SpMV kernel. This approach is probably easier and offers good performance when the number of “nonzeros per row” (in your case values per object) is > 50 or so.

You can read about both of these techniques in our Tech Report and Supercomputing on SpMV. The sparse matrix formats (COO, CSR) I mentioned above are described in these papers.

All the codes I mentioned are open source, so you are free to rewrite/adapt them to match your needs.

When I have time I’ll write an example that shows how to compute the mean of each segment with one of the above operations.

I would highly recommend storing the values in a “Structure of Arrays” format (i.e. all the lower.x values live in one array, all the lower.y values live in another array, etc.).

Topic		Replies	Views
Computing mean and standard deviation in parallel Cna we extend Parallel Reduction? CUDA Programming and Performance	6	13751	April 30, 2009
Reduction CUDA Programming and Performance	14	8843	August 9, 2010
Can anyone give some advice to optimize my kernel? CUDA Programming and Performance	4	452	November 8, 2022
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1800	November 18, 2009
Search max/min value with different parts of an array CUDA Programming and Performance cuda	5	1267	July 12, 2021
parallel find find multiple items from a array CUDA Programming and Performance	4	4380	February 23, 2009
Counting objects into an array CUDA Programming and Performance	4	4302	July 29, 2010
Finding max in array CUDA Programming and Performance	15	41505	November 26, 2017
Understanding and adjusting Mark Harris's array reduction CUDA Programming and Performance	11	4074	August 26, 2018
Parallel median computation CUDA Programming and Performance	3	6806	February 11, 2011

How to do Parallel Reduction of many unequally sized arrays in CUDA?

Related topics