I have tested the code associated with the paper
“Parallel prefix sum (Scan) with CUDA”.
I have also done a similar version, but in which I
use the naive algorithm at block level. That is, I
use the exact framework as the code associated
with the paper, but instead of a tree up and down
at block level, I use the naive algorithm at block
I am not very sure what to think of the results.
For 16M elements and 256 threads per block,
I get 15.90 ms for the tree up and down pattern,
while for the naive approach I get 13.38. For
128 threads per block, I get 13.90 for the tree
pattern and can not compile for the naive one.
The difference does not seem to be great, but
as far as I believe, the naive approach should
have been much worse. Does someone know
why this might happen or if this is an reasonable