Semantics of __syncthreads

Hi

I’m looking for a more detailed description of the semantics of__syncthreads(); .
Does anyone have a good source for information about that. I think the information in the
“programming guide” is a little short.

thank you and have a nice day

What more do you want? __syncthreads() is you garden variety thread barrier. Any thread reaching the barrier waits until all of the other threads in that block also reach it. It is designed for avoiding race conditions when loading shared memory, and the compiler will not move memory reads/writes around a __syncthreads().

It is nothing more and nothing less. Unless you are writing to a shared memory location in thread i then reading that same location in thread j, you don’t need __syncthreads().

For example the manual states that __syncthreads is allowed in conditional code but only if the conditional evaluates identically across the entire thread block.

this would mean that its not allowed to divide the threads of a block into two groups:

if (tid < n/2) {

__synchtreads();

}

else {

__synchthreads();

}

No, you can not do that.
Synchthreads() waits for all of the threads in a block.

Right, the programming guide states that pretty clearly. It follow clearly from the fact that __synchtreads() is a barrier that waits for all other threads in the block, as I said before.

Thank you.

The “syncthreads”, as I understand translates to the “bar” PTX instruction. THe PTX manual says that the instruction marks the arrival of a thread at the barrier and waits for all other threads to arrive.

This does NOT prevent you from dividing the block into 2 groups.

For example: The C code below works

"

if (threadIdx.x < 100)

{

}

__syncthreads();

"

I have divided my block to two and still keep working. If the code above is buggy , can some1 clarify?

If you ever want to do DIVIDE block into 2 groups and SYNCHRONIZE within themselves then you can still do it (virtually) by DIVIDING the IF-ELSE construct into multiple IF-ELSE construct (all dividing the blocks in the same fashion) with __syncthreads() in between each of them.

Cumbersome. Right. But it works!!! And, depending on what you r trying to do, such things can even kill performance.

Your code is valid, you can see examples in the reduction & scan projects from the SDK, where the amount of active threads is halved each time.

you can sure divide the block into more groups, but IF there is a __syncthreads(), all threads must meet it. If not, you get lockups.
(interestingly, the microcode instruction for bar has a bit field that could be there to describe which threads to wait for)

If you need any help with trial and error, let me know :)

In total there appear to be 12 bits for this purpose; normally, as generated by ptxas they are all set. To be exact, these are bit mask 0x001ffe00 of the first instruction word. I call them “BF_SYNC_ARG” in my assembler.

I have no idea what they’re exactly for, but as this is a barrier, it probably has something to do with which warps to synchronize.

You’re welcome to try some trial and error of course, it’d be very interesting. I have not yet thought up a good test case for it.

I am going to have to give your (dis)assembler a try then :)

making a simple program

running disassembler

changing little things

running assembler

should do the trick right?

I would think a small program like this:

__global__ void testkernel(int *g_data)

{

  __shared__ unsigned int dummytest[NUM_THREADS];

 unsigned int tid = threadIdx.x;

 dummytest[tid] = 0;

 __syncthreads();

 for (int k = 0; k < 100; k++)

    dummytest[tid] += tid;

 __synchtreads(); //we change this synchtreads bitmask afterwards.

 if (tid==0)

    for (int l = 0; l < NUMTHREADS; l++)

      g_data[l] = dummytest[l];

}

Should show differences in g_data when the bitmask has effect right?

That looks good. Disassemble the resulting cubin file (with -o, so it can be re-assembled) and then use raw syntax for the modified bar.sync instruction. The bar.sync 0 instruction is

0000b8: 861ffe03 00000000 bar.sync.u32 0x00000000

Inserting this instruction using the raw syntax in my assembler would be done like this:

d.u32 0x861ffe03, 0x00000000

With the bitfield set to all zeroes, this would be:

d.u32 0x86000003, 0x00000000

… and so on

I’m not sure if your test case is entirely reliable, maybe you should include some conditionals on the thread id, but it’s a start :) Good luck!

Thanks for your efforts guys.

Wumpus,
You say that therez this 12-bit field (1FFE) that could actually encode the WARPS that need syncrhonization. But there could be a total of 16 warps per block, right? 16*32 = 512. We can discount the WARP that is executing the SYNCTHREADS. So, that leaves us with 15 warps. But still that does NOT explain the “12”. Can you share your thoughts here?

Thank you

My guess is as bad as yours, just experiment if you want to know :) Or maybe an engineer of NVidia could comment but I wouldn’t hold my breath on that.

It’s very possible that the feature isn’t exposed because it is limited in some way to the first certain number of warps…

I tried. But the kernel fails if I change the BITs. It looked like no threads wait when I replace 1FFE with 0000. But that was not consistent. Sometimes it hangs and fails. I tried partially setting few (one) bits to zero in 1FFE. But all such experiments resulted in kernel-hang.

I did NOT use “decuda” for this. I just changed the CUBIN file and re-ran the STEPs in compilation (“fatbin” and CL invocations). I am NOT sure if my changes are affecting some check-sum stored somewhere. I dont know.

btw, Denis’ program does NOT expose the race condition in my hardware. I had to add a global memory fetch in the iteration to expose the race. I use 8800 GTX hardware.

Here is my code:

#include <stdio.h>

#define NUM_THREADS 512

__global__ void testkernel(int *g_data)

{

 __shared__ unsigned int dummytest[NUM_THREADS];

unsigned int tid = threadIdx.x;

dummytest[tid] = 0;

__syncthreads();

if (threadIdx.x >= 32)

 {

	for (int k = 0; k < 100; k++)

	{

	dummytest[tid] += (tid + g_data[tid]);

	}

 }

__syncthreads(); //we change this synchtreads bitmask afterwards.

if (tid==0)

   for (int l = 0; l < NUM_THREADS; l++)

     g_data[l] = dummytest[l];

}

int gData[NUM_THREADS];

void verify_gdata(void);

void display_gdata(void);

int main()

{

	dim3 grid, block;

	void *devptr;

	cudaMalloc(&devptr, sizeof(gData));

	cudaMemset(devptr, 0, sizeof(gData));

	block.x = NUM_THREADS;

	testkernel <<< grid, block >>> ((int *)devptr);

	cudaThreadSynchronize();

	cudaMemcpy(gData, devptr, sizeof(gData), cudaMemcpyDeviceToHost);

	display_gdata();

	printf("\nVerification Start\n");

	verify_gdata();

	printf("\nVerification done\n");

	return 0;

}

void verify_gdata(void)

{

	int i;

	for(i=32; i<NUM_THREADS; i++)

	{

  if (gData[i] != (i*100))

  {

  	printf("gData failure at %d\n", i);

  }

	}

	return;

}

void display_gdata(void)

{

	int i;

	for(i=0; i<NUM_THREADS; i++)

	{

  printf("%d ", gData[i]);

	}

}

Wumpus,

I am wondering if my “cubin” file change is breaking some “checksum” expectation somewhere in the CUDA driver.

Anyway, if thats the case I would expect the failure to be immediate. In my case, the kernel hangs… So, I would imagine that there is no checksum related thing. But can you just confirm?

Thanks
Sarnath

No, there is no checksum, if your kernel hangs it’s because the change in synchronisation had some unexpected result.

Are there any Xid errors in you dmesg log? If so, you are creating invalid instructions, if not, there might be some other problem.

For example, calling this ‘partial’ syncthreads in a thread that is not being waited for might result in deadlocks.

I am using Windows XP. Where do I check for this? I may re-run this test sometime this week. I can check it out if you know where I can see these errors. Thanks.