Trying to understand memory fence function example

Hi all,

I recently browsed through the CUDA programming manual again and came across the memory fence function example at page 109.
It says:

“If no fence is placed between storing the partial sum and incrementing the counter, the counter might increment before the partial sum is stored and therefore, might reach gridDim.x-1 and let the last block start reading partial sums before they have been actually updated in memory.”

But I noticed there is a call to __syncthreads() before the last block starts calculating the total sum and section B.6 states that “__syncthreads() waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.”

So I was wondering how the last block could start reading partial sums before they have been updated in memory if no fence is placed between storing the partial sum and incrementing the counter, taking into account that all shared memory accesses are visible to all threads after the __syncthreads() call.

N.

It looks like these are writes to GLOBAL memory across MULTIPLE BLOCKS. As I understand it from other posts on the forums, threadfence just ‘flushes’ all pending global memory writes before moving on.

Ah, that makes sense. Thanks for clearing that up External Media

N.

What if I want to flush global writes too?