Someone can help me with the Scan application?

I’m a newbie in CUDA. By learning examples I’m stepping in practice.

I’m now reading the application “Parallel Prefix Sum (Scan) with CUDA” by Mark Harris, which can be found in SDK folder or in the application page in CUDA zone. In the naive version of the implementation, the author suggests using a double-buffer approach, so that in the case of arrays larger than the warp size, the results of one warp will not be overwritten by threads in another warp.

In ‘Listing 1’ in Page6 (CUDA code for the naive scan algorithm), a __syncthreads() instruction is inserted in the end of every for-loop, that means, all threads within the block are synchronized for every step. My question is: do we still need the double buffer structure, since all threads are synchronized any way, and it seems to have nothing to do with warps? If yes, how does the double-buffer help then?

Completely confused… …