reduction_kernel.cu harris/reduction reduce6 missing __syncthreads() / volatile? if reduce6 is used

On device 1 of a 295 GTX I am having problems with adding together answers produced by separate threads.
(If all threads return their answer to the host and I add them together in the PC all is well)
So in desparation I am re-reading the documentation.
Harris’ reduce6 code makes great play of minimising the number of times __syncthreads() is used.
But it seems to me there may be a hole, if it were to be as part of a bigger kernel,
Then threads 32 etc could overwrite shared array sdata before thread 0 was finished.
To prevent this should I put another __syncthreads() after the code which write’s to g_odata ?

As a separate issue the manual (B.2.4) talks about volitle but reduction_kernel.cu does not use it.
Am I right in assuming it is not needed?

Many thanks
Bill
ps: CUDA 2.3 Driver 190.42-r2 Linux st127 2.6.29-gentoo-r5

    Dr. W. B. Langdon, 
    Department of Computer Science, 
    King's College London,
    Strand, London, WC2R 2LS, UK
    [url="http://www.dcs.kcl.ac.uk/staff/W.Langdon/"]http://www.dcs.kcl.ac.uk/staff/W.Langdon/[/url]

FOGA 2011 http://www.sigevo.org/foga-2011/
CIGPU 2010 http://www.cs.ucl.ac.uk/external/W.Langdon/cigpu
A Field Guide to Genetic Programming
http://www.gp-field-guide.org.uk/
RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet
GP EM http://www.springer.com/10710
GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/

ps: I forgot to say my comment was inspired by http://forums.nvidia.com/lofiversion/index.php?t99422.html