harris/reduction reduce6 missing __syncthreads() / volatile? if reduce6 is used

On device 1 of a 295 GTX I am having problems with adding together answers produced by separate threads.
(If all threads return their answer to the host and I add them together in the PC all is well)
So in desparation I am re-reading the documentation.
Harris’ reduce6 code makes great play of minimising the number of times __syncthreads() is used.
But it seems to me there may be a hole, if it were to be as part of a bigger kernel,
Then threads 32 etc could overwrite shared array sdata before thread 0 was finished.
To prevent this should I put another __syncthreads() after the code which write’s to g_odata ?

As a separate issue the manual (B.2.4) talks about volitle but does not use it.
Am I right in assuming it is not needed?

Many thanks
ps: CUDA 2.3 Driver 190.42-r2 Linux st127 2.6.29-gentoo-r5

    Dr. W. B. Langdon, 
    Department of Computer Science, 
    King's College London,
    Strand, London, WC2R 2LS, UK

FOGA 2011
CIGPU 2010
A Field Guide to Genetic Programming
GP Bibliography

ps: I forgot to say my comment was inspired by