reduction_kernel.cu harris/reduction reduce6 missing __syncthreads() / volatile? if reduce6 is used

On device 1 of a 295 GTX I am having problems with adding together answers produced by separate threads.
(If all threads return their answer to the host and I add them together in the PC all is well)
So in desparation I am re-reading the documentation.
Harris’ reduce6 code makes great play of minimising the number of times __syncthreads() is used.
But it seems to me there may be a hole, if it were to be as part of a bigger kernel,
Then threads 32 etc could overwrite shared array sdata before thread 0 was finished.
To prevent this should I put another __syncthreads() after the code which write’s to g_odata ?

As a separate issue the manual (B.2.4) talks about volitle but reduction_kernel.cu does not use it.
Am I right in assuming it is not needed?

Many thanks
Bill
ps: CUDA 2.3 Driver 190.42-r2 Linux st127 2.6.29-gentoo-r5

    Dr. W. B. Langdon, 
    Department of Computer Science, 
    King's College London,
    Strand, London, WC2R 2LS, UK
    [url="http://www.dcs.kcl.ac.uk/staff/W.Langdon/"]http://www.dcs.kcl.ac.uk/staff/W.Langdon/[/url]

FOGA 2011 [url=“FOGA 2011 - Foundations of Genetic Algorithms XI”]http://www.sigevo.org/foga-2011/[/url]
CIGPU 2010 [url=“http://www.cs.ucl.ac.uk/external/W.Langdon/cigpu”]http://www.cs.ucl.ac.uk/external/W.Langdon/cigpu[/url]
A Field Guide to Genetic Programming
[url=“http://www.gp-field-guide.org.uk/”]http://www.gp-field-guide.org.uk/[/url]
RNAnet [url=“http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet”]http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet[/url]
GP EM [url=“Genetic Programming and Evolvable Machines | Home”]http://www.springer.com/10710[/url]
GP Bibliography [url=“The Genetic Programming Bibliography”]http://www.cs.bham.ac.uk/~wbl/biblio/[/url]

ps: I forgot to say my comment was inspired by [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA