Let me give my rough take on this problem. This may or may not be efficient as I have not even coded it. I am designing it inside this text box.
SUM_KERNEL(float *g_in, float *g_out)
{

Divide the input array into “N” distinct subsets. N is the number of blocks that
you are going to spawn.

Now, each block has to operate on roughly “TOTAL/N” amount of dataitems.
Let us call this number as “M”.

g_in + (M*blockIdx.x) == SA (Start Address) for any given block

local_sum = 0;
for(i=threadIdx.x; i<M; i+=blockDim.x)
{
local_sum + = SA[i];
}
Note that all the global memory fetches are “coalesced”.

At this stage, each thread of a BLOCK holds a partial sum for THAT block.

g_out[blockIdx.x*blockDim.x + threadIdx.x] = local_sum;
Now, this demands that “g_out” requires as many entries as there are “threads”
for your kernel.
}
Now, it is just a question of calling “SUM_KERNEL” repeatedly with shrinking appropriate block numbers.
Also note that there is no synchronisation required between threads of a block.
Also, the shared memory usage is very very less.
So, depending on your hardware invoke the kernel with the correct amount of block and grid dimensions.
For example, On my 8800 GTX, I would call this with “32 threads per block” and 128 blocks. (16 multiprocessors * 8 max_active_blocks = 128 blocks).
That would mean – regardless of the input array size – I can reduce it to a partial sum of 4096 (128*32)elements.
Now, for input array sizes of the order 1 million, this 4096 is a very puny number.
Now, its upto the programmer to decide on optimal block and grid sizes and squeeze out performance.