Synchronization, threadfence, random memory access beginner questions

Hi,
I am very new to parallel programming and want to write a CUDA program for my GTX 480 card.
My questions are:

  1. The first warp in a block reads and writes to shared memory but every adress is only used once for read-modify-write,i.e.
    there is no read-read or write-read etc.
    The next warp does the same but will read and write data to adresses touched by the first warp. As I understood there is implicite
    synchronization within a warp. Hence I do not need __synchthread() (I do not care whether warp one operates before or
    after warp two). Is this correct ?

  2. I don’t understand the difference between __synchthread and __threadfence. Any good explainations (better than the programming guide) ?

  3. I need to write to a 24 MB array A[i][j] but the access is completley random, i.e. the indices i and j are determined
    at runtime. Moving the array back and forth to the CPU and to do this and some little math on the CPU seems very inefficient.
    Is there a way to do this on the GPU ? I am fine with slow code as long as it is faster than the other option on my 6-core CPU.
    I figured even ineffcient usage of the 15 MPs on the GPU should still be better. It sounded like texture memory doesn’t give
    much performance gain on Fermi cards and just looks too complicated ( I like robust and easy code, 30-50% performance loss doesn’t worry me too much) Is this correct? Any ideas ?

Thanks in advance

I’m confused, if you have the second warp reading and writing addresses used by the first warp, it sounds like you almost certainly have a race condition. Am I misunderstanding something?

Both are barriers, but as I understand it, __threadfence (and friends) are memory barriers designed to ensure that reads and writes from a thread are visible to other threads in a desired order. That is to say, it blocks a thread until its previous memory reads and writes are done, but does not stop other threads, who may be executing instructions before or after the fence point in general. __syncthreads() is an actual execution barrier, forcing all threads in a block to wait until all other threads in the block have arrived at the sync point.

It is a subtle difference, and you almost certainly want __syncthreads to avoid read-after-write hazards in shared memory.

Since you have a GTX 480, your best bet in this case is to just read and write the memory locations you want directly, and hope the L2 cache helps you a little bit.

I’m confused, if you have the second warp reading and writing addresses used by the first warp, it sounds like you almost certainly have a race condition. Am I misunderstanding something?

Both are barriers, but as I understand it, __threadfence (and friends) are memory barriers designed to ensure that reads and writes from a thread are visible to other threads in a desired order. That is to say, it blocks a thread until its previous memory reads and writes are done, but does not stop other threads, who may be executing instructions before or after the fence point in general. __syncthreads() is an actual execution barrier, forcing all threads in a block to wait until all other threads in the block have arrived at the sync point.

It is a subtle difference, and you almost certainly want __syncthreads to avoid read-after-write hazards in shared memory.

Since you have a GTX 480, your best bet in this case is to just read and write the memory locations you want directly, and hope the L2 cache helps you a little bit.

Sorry, you/re right, this won’t work. What I actually have in one warp is: thread 1 modifies variable A in shared memory,

thread 2 modifies variable B, but later in the kernel thread 1 also accesses B.

I want to make sure that the changes in B are visible to thread 1 when it gets there.

Would __threadfence() be needed here ? __synchthread() seems to strong to me since it would delay all the other warps

in the same block even though they have no dependencies, i.e. only deal with their own piece of memory.

Thanks a lot for you help so far.

Sorry, you/re right, this won’t work. What I actually have in one warp is: thread 1 modifies variable A in shared memory,

thread 2 modifies variable B, but later in the kernel thread 1 also accesses B.

I want to make sure that the changes in B are visible to thread 1 when it gets there.

Would __threadfence() be needed here ? __synchthread() seems to strong to me since it would delay all the other warps

in the same block even though they have no dependencies, i.e. only deal with their own piece of memory.

Thanks a lot for you help so far.

If thread1 and thread2 are always in the same warp, you won’t need anything, but if they are in different warps, you need __syncthreads().

Edit: And for the “in the same warp case”, you want to leave a big warning to whoever reads the code what is going on. The implicit warp synchronization is very non-obvious when reading CUDA code. You might even put the unnecessary __syncthreads() in to defend against future changes that would break that assumption, assuming the penalty is small.

__threadfence() is almost never the answer. :)

If thread1 and thread2 are always in the same warp, you won’t need anything, but if they are in different warps, you need __syncthreads().

Edit: And for the “in the same warp case”, you want to leave a big warning to whoever reads the code what is going on. The implicit warp synchronization is very non-obvious when reading CUDA code. You might even put the unnecessary __syncthreads() in to defend against future changes that would break that assumption, assuming the penalty is small.

__threadfence() is almost never the answer. :)

IS there a way to handle read after write hazard on global memory among all threads of a kernel?

How do i handle a situation where I have a work list, a set of work consumer threads, a set of work producer threads. Consumer threads are spinning on (head == tail) condition. When a producer thread block writes(adds) something on the worklist, ONE thread in the end updates the tail to the new value with atomicAdd(after all threads of block issued the write command and wait at _syncthreads).

How do I make the consumer threads to wait for producer threads’ writes to commit before starting to read them?