CUDA parallel prefix

nepalezu · July 11, 2007, 9:09am

Hello.

I have tested the code associated with the paper
“Parallel prefix sum (Scan) with CUDA”.

I have also done a similar version, but in which I
use the naive algorithm at block level. That is, I
use the exact framework as the code associated
with the paper, but instead of a tree up and down
at block level, I use the naive algorithm at block
level.

I am not very sure what to think of the results.
For 16M elements and 256 threads per block,
I get 15.90 ms for the tree up and down pattern,
while for the naive approach I get 13.38. For
128 threads per block, I get 13.90 for the tree
pattern and can not compile for the naive one.

The difference does not seem to be great, but
as far as I believe, the naive approach should
have been much worse. Does someone know
why this might happen or if this is an reasonable
result?

Tks,
Nep.

Topic		Replies	Views
Naive prefix sum algorithm from GPU gems not working CUDA Programming and Performance	1	863	February 2, 2020
Bugged code in website CUDA Programming and Performance	6	1558	October 8, 2015
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	274	July 7, 2024
Someone can help me with the Scan application? CUDA Programming and Performance	0	1880	August 25, 2008
A (not so) hypothetical question CUDA Programming and Performance	6	1649	March 24, 2009
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7751	July 22, 2010
Compute Cumulative Frequency CUDA Programming and Performance	5	5051	July 13, 2009
Learning by coding recursive sum using dynamic parallelism CUDA Programming and Performance	2	728	January 17, 2018
simple question CUDA Programming and Performance	5	739	August 2, 2011
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2371	December 9, 2011

CUDA parallel prefix

Related topics