That will not work: multiple threads are reading/writing sum simultaneously which will obviously lead to undefined results.
For an example how to do this, look at the scalaProd example or the scan example in the SDK. In particular, the scan whitepaper has a well written description: you only need the upsweep phase.