Performing multiple summations in one GPU kernel

AstroGPU · August 17, 2013, 3:32pm

Hi all,

I’d like to accelerate the following task on a GPU. The problem is one of reduction, however due to its nature, the “usual” trick of saving partial sums on shared memory won’t work.

This is the essence of the task:

void Task(double *x, double *Output) {
    for (int i = 0; i < 1000000; i++) {
        AuxiliaryNumber = 0;
        for (int k = 0; k < 600; k++) {
            Output[k] += SomeFunction(x[i], AuxiliaryNumber);
            AuxiliaryNumber = OtherFunction(x[i], AuxiliaryNumber);
        }
    }
}

The output is 600 numbers in an array, each of them is a sum of one million summands. The order of the loops can’t be switched, because the inner loop has to be executed sequentially. If instead of 600, I had a small number (like 5), then I could define a 2D structure in shared memory, 5xThreadsPerBlock, and have each thread write its partial sum to it, and then finalize the summation. There is simply not enough shared memory (by a long shot) to do this with a size-600 array. Having each thread write its partial sum to global memory also seems a pretty poor option, since this structure will be hundreds of MB in size (depending on the number of threads, of course)… It would fit, but I’m worried about performance.

Any ideas would be greatly appreciated!

njuffa · August 17, 2013, 4:25pm

Why does the inner loop have to be executed sequentially?

pasoleatis · August 17, 2013, 5:36pm

It appears you could each thread doing one value of i, unless x[i] depends on something else. You can divide the inner loop in smaller parts and do some writes to the global memory.

sBc-Random · August 18, 2013, 4:44am

Edit: misread problem.

Depends how complex OtherFunction is. If it’s highly complex (and therefore you are compute bound), atomicadds might be the way to go, if it’s very simple (or you could simplify it to something that’s easily repeatable), you should do what pasoleatis said and work on one value of i at a time.
For example:

OtherFunction(x) { return pow(x,2) }

Then OtherFunction(x, threadIdx) {return pow(x,2*threadIdx)}

Now the loop is parallelised.

Unfortunately I assume it’s the first version (given that it troubled you enough to post here :P), in which case you will have to resort to atomics most likely.

sBc-Random · August 18, 2013, 5:53am

Also, If you have a stack of free global memory, you could always store Auxiliary to global

Ie

void Task(double *x, double *Output, double *Auxiliary(10^6,1)) {
    AuxiliaryNumber = 0;
    for (int k = 0; k < 600; k++) {
    for (int i = 0; i < 1000000; i++) {
    Output[k] += SomeFunction(x[i], AuxiliaryNumber[i]);
    AuxiliaryNumber[i] = OtherFunction(x[i], AuxiliaryNumber[i]);
    }
    }
    }

AstroGPU · August 19, 2013, 4:01am

Thanks! It’s probably best to implement something like sBc-Random’s last suggestion. It seems that the requirement from global memory is pretty small, and it will allow use of shared memory for the reduction algorithm.

Topic		Replies	Views
Variable Initialisation on Device Routine CUDA Programming and Performance	4	2526	May 24, 2008
sum reduction Legacy PGI Compilers	3	3255	August 29, 2017
double loop with inner loop sum using reduction CUDA Programming and Performance	7	1664	April 3, 2012
sequential sum within a kernel. CUDA Programming and Performance	23	4996	September 8, 2008
total sum example CUDA Programming and Performance	3	7236	December 2, 2015
Reduce sum in shared memory using CUB CUDA Programming and Performance cuda , kernel , performance	9	387	October 3, 2024
How to speed up AtomicAdd kernel using shared memory CUDA Programming and Performance	9	9707	September 30, 2022
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	866	July 16, 2010
Reduce choice CUDA Programming and Performance	25	87	March 23, 2025
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3650	February 1, 2016

Performing multiple summations in one GPU kernel

Related topics