Optimize Reduction Process

Dear CUDA fellows,

I’m working now in making a non-uniform reduction, I’m using the scan provided in CUDPP library and it works great, I can have the memory address to run the scatter and have a reduced vector with useful information, the problem is when I see that a reduction would be a solution, I have to think inmediately that this process requires 3 major memory transfers:

1.- Every thread which has useful information needs to backup on global memory (scatter) before running the scan.
2.- When I have the addresses from the scan I write all the useful information in a smaller vector (another scatter).
3.- Finally now I continue my processing but these useful threads needs to read from the reduced vector (gather).

What I can see from this, is that the idea of reduction sounds really good, reducing into useful threads when the algorithm rejects many and using better the GPU resources. But on the other hand, I have to pay with memory transfers. Do you guys had face this problem before and tried to optimize? did you encounter any better approach?

I hope you can understand my inquiry, my english is not that good, thanks for your time.