Performing multiple summations in one GPU kernel

Hi all,

I’d like to accelerate the following task on a GPU. The problem is one of reduction, however due to its nature, the “usual” trick of saving partial sums on shared memory won’t work.

This is the essence of the task:

void Task(double *x, double *Output) {
    for (int i = 0; i < 1000000; i++) {
        AuxiliaryNumber = 0;
        for (int k = 0; k < 600; k++) {
            Output[k] += SomeFunction(x[i], AuxiliaryNumber);
            AuxiliaryNumber = OtherFunction(x[i], AuxiliaryNumber);
        }
    }
}

The output is 600 numbers in an array, each of them is a sum of one million summands. The order of the loops can’t be switched, because the inner loop has to be executed sequentially. If instead of 600, I had a small number (like 5), then I could define a 2D structure in shared memory, 5xThreadsPerBlock, and have each thread write its partial sum to it, and then finalize the summation. There is simply not enough shared memory (by a long shot) to do this with a size-600 array. Having each thread write its partial sum to global memory also seems a pretty poor option, since this structure will be hundreds of MB in size (depending on the number of threads, of course)… It would fit, but I’m worried about performance.

Any ideas would be greatly appreciated!

Why does the inner loop have to be executed sequentially?

It appears you could each thread doing one value of i, unless x[i] depends on something else. You can divide the inner loop in smaller parts and do some writes to the global memory.

Edit: misread problem.

Depends how complex OtherFunction is. If it’s highly complex (and therefore you are compute bound), atomicadds might be the way to go, if it’s very simple (or you could simplify it to something that’s easily repeatable), you should do what pasoleatis said and work on one value of i at a time.
For example:

OtherFunction(x) { return pow(x,2) }

Then OtherFunction(x, threadIdx) {return pow(x,2*threadIdx)}

Now the loop is parallelised.

Unfortunately I assume it’s the first version (given that it troubled you enough to post here :P), in which case you will have to resort to atomics most likely.

Also, If you have a stack of free global memory, you could always store Auxiliary to global

Ie

void Task(double *x, double *Output, double *Auxiliary(10^6,1)) {
    AuxiliaryNumber = 0;
    for (int k = 0; k < 600; k++) {
    for (int i = 0; i < 1000000; i++) {
    Output[k] += SomeFunction(x[i], AuxiliaryNumber[i]);
    AuxiliaryNumber[i] = OtherFunction(x[i], AuxiliaryNumber[i]);
    }
    }
    }

Thanks! It’s probably best to implement something like sBc-Random’s last suggestion. It seems that the requirement from global memory is pretty small, and it will allow use of shared memory for the reduction algorithm.