I’m having problems getting the reduction example in the SDK to run in emulation mode. Particularly the code section that assumes the last 32 threads performing the reduction will do so in lockstep ( "if(tid < 32) "…) The example in the SDK whitepaper left out some __syncthreads() calls because of this.
What I’m seeing in emulation mode though is that the threads actually run sequentially with thread 0 finishing all its additions before other threads have even started. So the value that ends up in the first element of the array is only the sum of a few elements not the sum of the entire array.
I haven’t had a chance to test this on an actual GPU but I’m guessing this is only a problem with the emulator and not the actual hardware?