Doesn't this write to the same thread?

Hello ,

I saw this unrolling:

int tid = threadIdx.x;

sharedData[ tid ] += sharedData[ tid + 32 ]
sharedData[ tid ] += sharedData[ tid + 16 ]
sharedData[ tid ] += sharedData[ tid + 8 ]
....

The right side doesn’t write in the same value sharedData[ tid ] ?

Thanks

does this not cause a race condition/ data contention…?

should you not use some synchronization (__syncthreads())), or at least write to local memory in the 1st step, and to shared memory in the 2nd step?

That’s what I am saying.

So , either you must add a __syncthreads() at every line or you must do sth like:

sharedData[ tid + 32 ] += sharedData[ tid + 32 ]
sharedData[ tid + 16 ] += sharedData[ tid + 16 ]
sharedData[ tid + 8 ] += sharedData[ tid + 8 ]

Right?

I found another similar example ( I don’t remember where I saw it at the first place)
http://www.bu.edu/pasi/files/2011/07/Lecture5.pdf.

In page 22 , why does he have it like this?
In page 30 he has it right , he uses syncthreads().

You don’t need a syncthreads at every line if you are in a warp-synchronous mode and the sharedData pointer is declared volatile.

Please review the cuda parallel reduction tutorial.

http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

e.g. slides 21-22

Your suggestions e.g.:

sharedData[ tid + 32 ] += sharedData[ tid + 32 ]

don’t make any sense in the context of a parallel reduction.

You don't need a syncthreads at every line if you are in a warp-synchronous mode and the sharedData pointer is declared volatile.

So , if I don’t declare as volatile , I need syncthreads in each line.

If you can tell why in the link I gave above in page 22 he doesn’t use syncthreads or volatile?

It happened to work by pure chance with older versions of the CUDA toolkit. Since it worked, you would find the code without volatile in quite a few places - IIRC even in Nvidia’s own documentation. And then it suddenly broke with newer compilers…

Ok ,thanks for the info!