How to do Parallel Reduction of many unequally sized arrays in CUDA?

Hi there

I am wondering if anyone could suggest the best approach to computing the mean / standard deviation of a large number of relatively small but differently sized arrays in CUDA?

The parallel reduction example in the SDK works on a single very large array and it seems the size is conveniently a multiple of the number of threads per block… but my case is rather different:

Conceptually I however have a large number of objects which each contain two components, upper and lower and each of these components has an x and a y coordinate. i.e. upper.x, lower.y, upper.x, lower.y

Each of these arrays is approximately 800 in length but it varies between objects (not within an object) e.g.

Object1.lower.x = 1.1, 2.2, 3.3
Object1.lower.y = 4.4, 5.5, 6.6
Object1.upper.y = 7.7, 8.8, 9.9
Object1.upper.y = 1.1, 2.2, 3.3

Object2.model.x = 1.0, 2.0, 3.0, 4.0, 5.0
Object2.model.y = 6.0, 7.0, 8.0, 9.0, 10.0
Object2.image.y = 11.0, 12.0, 13.0, 14.0, 15.0
Object2.image.y = 16.0, 17.0, 18.0, 19.0, 20.0

Please note the above is just my way of representing the array and my data is not stored in C structs or anything like that, the data can be organised in whatever way I need. The point is that for each array the mean, standard deviation and eventually a histogram needs to be computed and within one particular object, ratios and differences between arrays need to be computed.

So how should I go about sending this data to the GPU device and organising my thread-block hierarchy? One idea I had was to zero pad all my arrays so they are of the same length, and have a group of blocks working on each object… but it seems there are all sorts of problems with that method if it would work at all.

Thanks in advance

What you want to do can be implemented with a “segmented scan” or “segmented reduction” operation. Here’s a good paper that describes the operation.

Segmented scan is the more common of the two operations and you can find implementations in Thrust (example here) and CUDPP.

Segmented reduction is less common but more directly solves the problem you’re after. You can find one implementation in our Sparse Matrix-Vector Multiplication code for the COO matrix format.

Since you say that your objects have ~800 values each you might also consider a different approach that resembles theCSR (vector) SpMV kernel. This approach is probably easier and offers good performance when the number of “nonzeros per row” (in your case values per object) is > 50 or so.

You can read about both of these techniques in our Tech Report and Supercomputing on SpMV. The sparse matrix formats (COO, CSR) I mentioned above are described in these papers.

All the codes I mentioned are open source, so you are free to rewrite/adapt them to match your needs.

When I have time I’ll write an example that shows how to compute the mean of each segment with one of the above operations.

I would highly recommend storing the values in a “Structure of Arrays” format (i.e. all the lower.x values live in one array, all the lower.y values live in another array, etc.).