Fast summation

I want to sum all the values in a volume of size 128 x 128 x 128, how do I do that in the most efficient way in CUDA?

I’ve tried to launch a thread for each (x,y) position that sums over 128 z-positions each but it becomes really slow since each thread has to read from the memory 128 times.

Any ideas how to speed this up?

Just “forget” that it is a volume for that kernel and run 128128128 threads in a standard reduction pattern.

I’m using global memory so I do the indexing myself, (x + y * DATA_W + z * DATA_W * DATA_H), to pretend it is a volume. What do you mean by standard reduction pattern?

There is a reduction sample in the CUDA SDK that efficiently adds all values in an array.

Thank you External Media