I’ve found a bug in the code published on http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html.
The example Example 39-1(Naive Scan Algorithm), at row 16 should be
temp[pout*n+thid] = temp[pin*n+thid]+temp[pin*n+thid - offset];
, because otherwise it would get the partial sum of two iterations before…same on the relative pdf (Parallel Prefix Sum
(Scan) with CUDA).
In addition,I think that at row 14 it would be better to write
pin = 1 - pin;
It’s more clear,in addition is avoided an execution dependency with previous line.