I want to sum all the values in a volume of size 128 x 128 x 128, how do I do that in the most efficient way in CUDA?
I’ve tried to launch a thread for each (x,y) position that sums over 128 z-positions each but it becomes really slow since each thread has to read from the memory 128 times.
Any ideas how to speed this up?
Just “forget” that it is a volume for that kernel and run 128128128 threads in a standard reduction pattern.
I’m using global memory so I do the indexing myself, (x + y * DATA_W + z * DATA_W * DATA_H), to pretend it is a volume. What do you mean by standard reduction pattern?
There is a reduction sample in the CUDA SDK that efficiently adds all values in an array.