I’m adding the elements of a big array into a single output. I’ll need to execute several blocks in order to compute the sum given the dimensions of the array.
Right now I’m executing the blocks in series, so I can guarantee that there are no simultaneous writes to the output.
Currently as I’m calling a single block at a time, my code looks something like this:
if (threadIdx.x == 0) *out += *shared_out;
My question is: Can I use __threadfence() to ensure there are no overlapping writes between two different blocks? Something like this:
if (threadIdx.x == 0) {
*out += *shared_out;
__threadfence();
}
Oh __threadfence(), the most confusing instruction or intrinsic to probably ever appear anywhere.
It’s not a synchronization primitive, so no, it won’t do what you want here. However, if you did have another synchronization primitive and you didn’t want to use atomics, you would have to call threadfence before releasing the primitive to ensure that the write is visible to the now-active block.
I have never work with atomics. Can you please post a simple example for this situation?
Also, I have read somewhere that atomic operations are slow. I have to sum around 20000 elements (this gives about 40 blocks of 512 threads). Would it be faster to have an array where to place the 40 results and then sum them up?