Trying to understand memory fence function example

Nico · June 23, 2009, 9:26am

Hi all,

I recently browsed through the CUDA programming manual again and came across the memory fence function example at page 109.
It says:

“If no fence is placed between storing the partial sum and incrementing the counter, the counter might increment before the partial sum is stored and therefore, might reach gridDim.x-1 and let the last block start reading partial sums before they have been actually updated in memory.”

But I noticed there is a call to __syncthreads() before the last block starts calculating the total sum and section B.6 states that “__syncthreads() waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.”

So I was wondering how the last block could start reading partial sums before they have been updated in memory if no fence is placed between storing the partial sum and incrementing the counter, taking into account that all shared memory accesses are visible to all threads after the __syncthreads() call.

N.

jph4599 · June 23, 2009, 11:56am

Hi all,

I recently browsed through the CUDA programming manual again and came across the memory fence function example at page 109.

It says:

“If no fence is placed between storing the partial sum and incrementing the counter, the counter might increment before the partial sum is stored and therefore, might reach gridDim.x-1 and let the last block start reading partial sums before they have been actually updated in memory.”

But I noticed there is a call to __syncthreads() before the last block starts calculating the total sum and section B.6 states that “__syncthreads() waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.”

So I was wondering how the last block could start reading partial sums before they have been updated in memory if no fence is placed between storing the partial sum and incrementing the counter, taking into account that all shared memory accesses are visible to all threads after the __syncthreads() call.

N.

It looks like these are writes to GLOBAL memory across MULTIPLE BLOCKS. As I understand it from other posts on the forums, threadfence just ‘flushes’ all pending global memory writes before moving on.

Nico · June 23, 2009, 12:02pm

Ah, that makes sense. Thanks for clearing that up External Media

N.

tugrul_192bit · March 24, 2018, 9:45am

What if I want to flush global writes too?

Topic		Replies	Views
add elements of array has any body implemented CUDA Programming and Performance	2	3879	October 13, 2009
__syncthreads and __threadfence together in a loop CUDA Programming and Performance	5	3709	October 15, 2010
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29824	April 22, 2015
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	3071	January 25, 2010
Memory Consistency and __syncthreads() CUDA Programming and Performance	2	7357	July 5, 2011
Problems with __threadfence CUDA Programming and Performance	2	3199	November 11, 2009
Different cuda blocks see different values for global memory Legacy PGI Compilers (archived)	3	4417	June 22, 2011
__threadfence_block() vs __threadfence() ? CUDA Programming and Performance	6	7213	July 13, 2022
Global memory coherence in compute capability 2.0 Does __threadfence() really do what's on the t CUDA Programming and Performance	1	3767	April 11, 2012
Synchronization, threadfence, random memory access beginner questions CUDA Programming and Performance	7	2812	April 9, 2012

Trying to understand memory fence function example

Related topics