__syncthreads thread syncronization

Hi all,
i am working on cuda. i try to understand ____syncthreads. i can not see any difference neither i use __syncthreads or not.
please give me an example using __syncthreads and explain how it works.
thank you.

__syncthreads will wait for all warps in a block to reach that point in your code. Say you are processing audio - each thread doing it’s own little detail - and have produced the raw data for one sample, you will then use __syncthreads before mixdown to the final reduced format (stereo, 7.1 …) to assure that everybody is on the same page. Without it the different warps will run ahead or behind each other, resulting in a noisy mix of partially unrelated data.

please give me an example of it.

matrix multiplication in page 25 of programming guide 2.3

More generally, __syncthreads() is a barrier primitive designed to protect you from read-after-write memory race conditions within a block.

The rules of use are pretty simple:

  1. Put a __syncthreads() after the write and before the read when there is a possibility of a thread reading a memory location that another thread has written to.

  2. __syncthreads() is only a barrier within a block, so it cannot protect you from read-after-write race conditions in global memory unless the only possible conflict is between threads in the same block. __syncthreads() is pretty much always used to protect shared memory read-after-write.

  3. Do not use a __syncthreads() call in a branch or a loop until you are sure every single thread will reach the same __syncthreads() call. This can sometimes require that you break your if-blocks into several pieces to put __syncthread() calls at the top-level where all threads (including those which failed the if predicate) will execute them.

  4. When looking for read-after-write situations in loops, it helps to unroll the loop in your head when figuring out where to put __syncthread() calls. For example, you often need an extra __syncthreads() call at the end of the loop if there are reads and writes from different threads to the same shared memory location in the loop.

  5. __syncthreads() does not mark a critical section, so don’t use it like that.

  6. Do not put a __syncthreads() at the end of a kernel call. There’s no need for it.

  7. Many kernels do not need __syncthreads() at all because two different threads never access the same memory location.

2.1. Threads within a warp are executed together and are implicitly synced.

Or is that a slightly too far reaching statement? (I have arranged for a few instructions between writes and read to shared within a warp, to avoid getting overly concerned about this.)

All threads of a block (use 512 threads) loads data from memory.

Now thread 0 adds its data with data fetched from 511th thread…
Thread 1 adds its data with 510th thread and so on until 256th thread…
257th to 511th thread just fetches data and does not participate in addition…

Now, if u dont add syncthreads between the load and the calculation – you will get bad results…

Prototype this and see it for yourself.

Oh right, forgot about that… That’s true, though I’ve never relied on this behavior myself. Read-after-write at the warp level is supposed to be safe automatically. (Although I’m curious how that is accomplished when the scheduling unit in hardware seems to be the half-warp. There are some subtleties there I don’t quite understand.)