Synchronization, threadfence, random memory access beginner questions

andymon2010 · June 29, 2010, 2:24pm

Hi,
I am very new to parallel programming and want to write a CUDA program for my GTX 480 card.
My questions are:

The first warp in a block reads and writes to shared memory but every adress is only used once for read-modify-write,i.e.
there is no read-read or write-read etc.
The next warp does the same but will read and write data to adresses touched by the first warp. As I understood there is implicite
synchronization within a warp. Hence I do not need __synchthread() (I do not care whether warp one operates before or
after warp two). Is this correct ?
I don’t understand the difference between __synchthread and __threadfence. Any good explainations (better than the programming guide) ?
I need to write to a 24 MB array A[i][j] but the access is completley random, i.e. the indices i and j are determined
at runtime. Moving the array back and forth to the CPU and to do this and some little math on the CPU seems very inefficient.
Is there a way to do this on the GPU ? I am fine with slow code as long as it is faster than the other option on my 6-core CPU.
I figured even ineffcient usage of the 15 MPs on the GPU should still be better. It sounded like texture memory doesn’t give
much performance gain on Fermi cards and just looks too complicated ( I like robust and easy code, 30-50% performance loss doesn’t worry me too much) Is this correct? Any ideas ?

Thanks in advance

seibert · June 29, 2010, 4:32pm

I’m confused, if you have the second warp reading and writing addresses used by the first warp, it sounds like you almost certainly have a race condition. Am I misunderstanding something?

Both are barriers, but as I understand it, __threadfence (and friends) are memory barriers designed to ensure that reads and writes from a thread are visible to other threads in a desired order. That is to say, it blocks a thread until its previous memory reads and writes are done, but does not stop other threads, who may be executing instructions before or after the fence point in general. __syncthreads() is an actual execution barrier, forcing all threads in a block to wait until all other threads in the block have arrived at the sync point.

It is a subtle difference, and you almost certainly want __syncthreads to avoid read-after-write hazards in shared memory.

Since you have a GTX 480, your best bet in this case is to just read and write the memory locations you want directly, and hope the L2 cache helps you a little bit.

seibert · June 29, 2010, 4:32pm

I’m confused, if you have the second warp reading and writing addresses used by the first warp, it sounds like you almost certainly have a race condition. Am I misunderstanding something?

Both are barriers, but as I understand it, __threadfence (and friends) are memory barriers designed to ensure that reads and writes from a thread are visible to other threads in a desired order. That is to say, it blocks a thread until its previous memory reads and writes are done, but does not stop other threads, who may be executing instructions before or after the fence point in general. __syncthreads() is an actual execution barrier, forcing all threads in a block to wait until all other threads in the block have arrived at the sync point.

It is a subtle difference, and you almost certainly want __syncthreads to avoid read-after-write hazards in shared memory.

Since you have a GTX 480, your best bet in this case is to just read and write the memory locations you want directly, and hope the L2 cache helps you a little bit.

andymon2010 · June 30, 2010, 8:26am

I’m confused, if you have the second warp reading and writing addresses used by the first warp, it sounds like you almost certainly have a race condition. Am I misunderstanding something?

Both are barriers, but as I understand it, __threadfence (and friends) are memory barriers designed to ensure that reads and writes from a thread are visible to other threads in a desired order. That is to say, it blocks a thread until its previous memory reads and writes are done, but does not stop other threads, who may be executing instructions before or after the fence point in general. __syncthreads() is an actual execution barrier, forcing all threads in a block to wait until all other threads in the block have arrived at the sync point.

It is a subtle difference, and you almost certainly want __syncthreads to avoid read-after-write hazards in shared memory.

Sorry, you/re right, this won’t work. What I actually have in one warp is: thread 1 modifies variable A in shared memory,

thread 2 modifies variable B, but later in the kernel thread 1 also accesses B.

I want to make sure that the changes in B are visible to thread 1 when it gets there.

Would __threadfence() be needed here ? __synchthread() seems to strong to me since it would delay all the other warps

in the same block even though they have no dependencies, i.e. only deal with their own piece of memory.

Thanks a lot for you help so far.

andymon2010 · June 30, 2010, 8:26am

I’m confused, if you have the second warp reading and writing addresses used by the first warp, it sounds like you almost certainly have a race condition. Am I misunderstanding something?

Both are barriers, but as I understand it, __threadfence (and friends) are memory barriers designed to ensure that reads and writes from a thread are visible to other threads in a desired order. That is to say, it blocks a thread until its previous memory reads and writes are done, but does not stop other threads, who may be executing instructions before or after the fence point in general. __syncthreads() is an actual execution barrier, forcing all threads in a block to wait until all other threads in the block have arrived at the sync point.

It is a subtle difference, and you almost certainly want __syncthreads to avoid read-after-write hazards in shared memory.

Sorry, you/re right, this won’t work. What I actually have in one warp is: thread 1 modifies variable A in shared memory,

thread 2 modifies variable B, but later in the kernel thread 1 also accesses B.

I want to make sure that the changes in B are visible to thread 1 when it gets there.

Would __threadfence() be needed here ? __synchthread() seems to strong to me since it would delay all the other warps

in the same block even though they have no dependencies, i.e. only deal with their own piece of memory.

Thanks a lot for you help so far.

seibert · June 30, 2010, 4:18pm

If thread1 and thread2 are always in the same warp, you won’t need anything, but if they are in different warps, you need __syncthreads().

Edit: And for the “in the same warp case”, you want to leave a big warning to whoever reads the code what is going on. The implicit warp synchronization is very non-obvious when reading CUDA code. You might even put the unnecessary __syncthreads() in to defend against future changes that would break that assumption, assuming the penalty is small.

__threadfence() is almost never the answer. :)

seibert · June 30, 2010, 4:18pm

If thread1 and thread2 are always in the same warp, you won’t need anything, but if they are in different warps, you need __syncthreads().

Edit: And for the “in the same warp case”, you want to leave a big warning to whoever reads the code what is going on. The implicit warp synchronization is very non-obvious when reading CUDA code. You might even put the unnecessary __syncthreads() in to defend against future changes that would break that assumption, assuming the penalty is small.

__threadfence() is almost never the answer. :)

sidxavier · April 9, 2012, 9:08am

IS there a way to handle read after write hazard on global memory among all threads of a kernel?

How do i handle a situation where I have a work list, a set of work consumer threads, a set of work producer threads. Consumer threads are spinning on (head == tail) condition. When a producer thread block writes(adds) something on the worklist, ONE thread in the end updates the tail to the new value with atomicAdd(after all threads of block issued the write command and wait at _syncthreads).

How do I make the consumer threads to wait for producer threads’ writes to commit before starting to read them?

Topic		Replies	Views
Problem with correct branching within a warp CUDA Programming and Performance	23	15634	May 28, 2009
about the __syncwarp() in P100 CUDA Programming and Performance	11	4045	June 6, 2018
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2771	December 8, 2010
32 thread block doesn't need _syncthreads()? CUDA Programming and Performance	18	13704	January 21, 2024
Synchronization methods? CUDA Programming and Performance	11	2094	November 7, 2010
Do I understand the nuances of __syncwarp() and __shfl() correctly? CUDA Programming and Performance	12	240	July 31, 2024
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29172	April 22, 2015
warp synchronization test CUDA Programming and Performance	5	1656	September 2, 2014
Global thread barrier CUDA Programming and Performance	78	85598	December 23, 2011
CUDA Memory Consistency CUDA Programming and Performance	23	55486	March 8, 2007

Synchronization, threadfence, random memory access beginner questions

Related topics