CUDA SDK Reduction correctness question


In the reduction project of the CUDA SDK, three blocks of memory are allocated:

  1. host of size N (values to reduce)
  2. device of size N (to hold initial values from host)
  3. device of size M (to hold reduced values from the last pass)

When looking at benchmarkReduce() in, if we don’t use cpuFinalReduction, it appears that for all passes except the first, memory block 3 is used for both input and output. Shouldn’t these reduction passes be double buffered (separate input and output memory blocks)? Clearly the reduction self-validates so double buffering isn’t needed but I’m wondering if it’s possible for block a (a>1) to write it’s result before block 1 reads its input, thereby messing up block 1’s input with newer values.

Can somebody explain why that doesn’t happen? Is it simply that blocks are sequentially scheduled to multiprocessors? (i.e. block 1 always goes first).