__syncthreads question

Hello All,

I read about use of __syncthreads and have been using it but have some doubts. I need clarification for my understanding.

Q1. If I have a shared variable say shared float x; and want to assign it some computed result then __syncthreads() is used to maintain consistency by all threads.

Suppose I have two shared variables shared float x and shared float y;

and then have two consecutive statements:

x = 25 * floorf( PI * z ) .... so on , followed by 

y = 23 * floorf(PI * sin (angle));

Then , should I be putting __syncthreads() after each statement or after the last statement only, as in

x = 25 * floorf( PI * z ) .... so on , followed by 

__syncthreads();

y = 23 * floorf(PI * sin (angle)); 

__syncthreads();

OR

x = 25 * floorf( PI * z ) .... so on , followed by 

y = 23 * floorf(PI * sin (angle); 

__syncthreads()

You need to insert a __syncthreads() at some point between a write to a shared memory location and a read of that shared memory location by another thread in the block. There’s no problem if you’re doing a lot of writes in a row without inserting __syncthreads() after every write.

Let’s start by noticing that you have a huge race condition - all threads try to write to the same variable.

Only a single thread should write to a single variable/address(even if it’s shared).

synchtreads can be used in these conditions:

  1. You have a shared array to which threads write and then read, like:
#define blockSize 256

__shared__ int sArray[blockSize];

sArray[threadIdx.x] = a;

int b = sArray[blockSize-1 - threadIdx.x];

So, thread 0 will write its value a into sArray[0] and then try to read sArray[255]. We need to make sure that thread #255 has written its value a to this address and we do that by setting a barrier after a write and before a read:

sArray[threadIdx.x] = a;

__syncthreads()

int b = sArray[blockSize-1 - threadIdx.x];

Now we are sure that all threads have written to their respectable addresses in sArray (sArray is initialized after the barrier) before they start reading.

If there are multiple writes, you should sync before the first read.

  1. One of the threads may set up a flag and we don’t want the others to miss it:
__shared__ bool flag = false;

if( /*a condition usually achieved only by one or a few threads of the block */)

  flag = true;

__syncthreads(); //remember, not within a clause block!

if(flag)

   //do sth

__syncthreads() here makes sure that all threads wait before checking the flag until all other threads get pass the instructions that may have set the flag.

These two cases demonstrate using barriers for shared arrays and shared variables.

You always place a barrier after writes to shared memory and before reads from shared memory.

Thank you tmurray and Big_Mac.

Q2. What about this:

__shared__ float x; 

__shared__ float z[10];

__shared__ float y;

__shared__ int w;

x = 25 * floorf(__cosf(angle));

y = 23 * floorf(__sinf(angle));

z[k++] = x * y;

for(i = 0; i < 10; i++)

{

	  if(z[i] > 0.5f)

	 {

		   w+=1;

	 }

}

Assume above as a portion of code inside a loop where variable ‘angle’ is a changing input and ‘k’ is a counter.

In this case, will be like this:

__shared__ float x; 

__shared__ float z[10];

__shared__ float y;

__shared__ int w;

x = 25 * floorf(__cosf(angle));

y = 23 * floorf(__sinf(angle));

__syncthreads(); /////////////////////// As the above two are independent 

z[k++] = x * y;

for(i = 0; i < 10; i++)

{

	  if(z[i] > 0.5f)

	 {

		   w+=1;

		   __syncthreads();	   

	 }

	 __syncthreads();

}

Guidance appreciated.

You still have a huge race and I don’t see where you’re going with this.

See, each and every thread of the grid will execute this code.
shared variables are declared as shared per block, so all threads within the block will see the same x, y, z and w. If you have 256 threads in a block, each of them will try to write and read from single shared variables x, y and w. In case of z, it’s an array but it’s accessed sequentially and it boils down to the same thing. All threads try to access the same addresses.

Answer me this - why do you use shared memory in this code? What’s its purpose?

I am using shared memory to do computations. Though I have put an example here but the guide says that shared memory is fast.I thought to shift complex computations to shared memory for speedup.

Please correct me if I am wrong in my approach.Thanks.

Just use regular private variables. Whatever would end up in stack in normal C (declaration like “int a”) usually goes into registers and is as fast or faster than shared memory. Such variables are thread-local and there’s never any race there.

I see.

Q1. So do you mean that if I use external arrays and variables, then I use shared variables ??

I read the guide and also agree to your earlier post in this thread, where you illustrated the use of shared variables. However, the guide is not clear in what context to use shared variables.Why I am stressing on shared memory is, according to guide that talks about expensive delays associated with local and global memories accesses.

Q2. A for-loop has a condition too and all threads must agree to it.Then syncthreads() must be used after the for loop body — is it right?

My main aim is to speed up code by shifting all computations at shared memory. My question and discussion is not related to the code I posted. It is for my clarification.The code I am working at is complex but can not post due to ownership reasons.Thanks a million.

A1. Accesses to global memory have high latency and limited bandwidth. But to populate your shared memory buffers you also would have to do a memory fetch at least once.

So, this is bad:

__global__ void kernel(float a[]) //a[] is in global memory

{

int idx = threadIdx....

for(int i=0; i<100; ++i)

  a[idx]++;

}

Because there are 100 memory accesses happening in the loop. 200 if we count read and write as separate ones. Each iteration is “go fetch an element from the ‘a’ array, increment it, store back”

This is good:

__global__ void kernel(float a[])

{

int idx = threadIdx....

float myA = a[idx]; //this goes into a register

for(int i=0; i<100; ++i)

  myA++;

a[idx] = myA; //write back

}

Now you only have two global memory accesses. The data stays on-chip, not in shared memory but in registers.

What you were meaning to do was

__global__ void kernel(float a[])

{

int idx = threadIdx....

__shared__ float sharedA[blockSize];

float sharedA[threadIdx.x]= a[idx]; //this gets into a shared array

for(int i=0; i<100; ++i)

  sharedA[threadIdx.x]++;

a[idx] =sharedA[threadIdx.x]; //write back

}

There are three things to note here:

  1. If you want each thread to use shared memory as it’s personal scatchpad-like memory, you need to allocate a shared array that’s as big as your thread block. Remember, shared variables work differently than normal “float a” or “int b”. Normal “float a” or “int b” variables end up being thread-local, that is each thread that executes the kernel allocates them privately (usually in registers, unless they spill). When you put an innocent shared clause in front of them, they become block-local, meaning each thread in a block will now refer to the same physical shared variable. Which means that if I took my second code example and put a shared in front of myA, this poor single variable would try to service all threads in the block, in a massive race condition. Without shared, myA is local to each thread. Here, I’ll draw you a picture:

  1. Private variables in registers are never slower than shared memory. Which means that the second solution I posted is just as fast as the third, if not faster (because not using shared memory reduces resource requirements for a block and therefore potentially increases occupancy). Besides, IIRC the hardware can’t actually work on shared memory directly, ie. if you did
sharedA[0] = logf(sharedA[0]);

it would end up as “fetch the contents of sharedA[0] into a register, do log on this register, store result back into sharedA[0]”. Not a performance killer but an extra clock tick.

  1. If there’s no actual sharing of data going on, the threads do not communicate through your shared memory block then it’s a good hint that there’s no need to use it and you should just rely on registers unless you really know what you’re doing.

A2. The hardware takes care of control divergence, ie. if-else and early loop breaks.

__syncthreads() is used exclusively in conjunction with shared memory and even then only when there’s real sharing or communication.

Wow, Marvelous reply. That is quite eye opener. The picture is very helpful. Thank you very much Big_Mac for a detailed reply. :)