Speed-ups for Reduction

I am trying to run the Parallel reduction problem in the CUDA SDK’s examples on the default set of values (kernel 6, 1048576 elements,128 threads per block and 64 thread blocks being launched ) on a 8800 GT card. I get a GPU time of 1.474637 ms and a CPU time of 3.469666 ms and a bandwidth of 2.8744634 GB/second. Is it not a poor bandwidth usage? I tried varying the thread/block sizes, and increased the number of elements also but there was no significant change in the bandwidth usage.

Is the low bandwidth usage and the low speed up achieved because of the Host and Device memory communication delays or is there any other reason for it?

That’s not right, I’m getting 40GB/s bandwidth on a 8800GTS (takes about 0.1ms)

Any idea as to why I am getting such a low bandwidth?