Hi all,

I’d like to accelerate the following task on a GPU. The problem is one of reduction, however due to its nature, the “usual” trick of saving partial sums on shared memory won’t work.

This is the essence of the task:

```
void Task(double *x, double *Output) {
for (int i = 0; i < 1000000; i++) {
AuxiliaryNumber = 0;
for (int k = 0; k < 600; k++) {
Output[k] += SomeFunction(x[i], AuxiliaryNumber);
AuxiliaryNumber = OtherFunction(x[i], AuxiliaryNumber);
}
}
}
```

The output is 600 numbers in an array, each of them is a sum of one million summands. The order of the loops can’t be switched, because the inner loop has to be executed sequentially. If instead of 600, I had a small number (like 5), then I could define a 2D structure in shared memory, 5xThreadsPerBlock, and have each thread write its partial sum to it, and then finalize the summation. There is simply not enough shared memory (by a long shot) to do this with a size-600 array. Having each thread write its partial sum to global memory also seems a pretty poor option, since this structure will be hundreds of MB in size (depending on the number of threads, of course)… It would fit, but I’m worried about performance.

Any ideas would be greatly appreciated!