cuda threads synchronization

say I have a block with 384 threads, and I will use 256 of them to do one task and all the threads to finish up.

[indent]int i = threadIdx.x;
shared int a[5];
shared int buffer[256];
… …
if (i < 256)
{
[indent] (do something to buffer)
__syncthreads();
if (i < 5)
a[i] = buffer[i*50];[/indent]
}
__syncthreads();

… …[/indent]

after this code, it seems to me that threads 256 to 383 don’t see the changes made to a. They read a[0] to a[4] 0.The code is too long to post, it will be helpful if any one can give a wild guess what could cause this. Thanks.

say I have a block with 384 threads, and I will use 256 of them to do one task and all the threads to finish up.

[indent]int i = threadIdx.x;
shared int a[5];
shared int buffer[256];
… …
if (i < 256)
{
[indent] (do something to buffer)
__syncthreads();
if (i < 5)
a[i] = buffer[i*50];[/indent]
}
__syncthreads();

… …[/indent]

after this code, it seems to me that threads 256 to 383 don’t see the changes made to a. They read a[0] to a[4] 0.The code is too long to post, it will be helpful if any one can give a wild guess what could cause this. Thanks.

The __syncthreads needs to be executed by all threads. So do this instead:

int i = threadIdx.x;

	__shared__ int a[5];

	__shared__ int buffer[256];

	// ... ...

	if (i < 256)

	{

		// (do something to buffer)

	}

	__syncthreads();

	if (i < 256) {

		if (i < 5)

			a[i] = buffer[i*50];

		// ...

	}

	__syncthreads();

	// ... ...

The __syncthreads needs to be executed by all threads. So do this instead:

int i = threadIdx.x;

	__shared__ int a[5];

	__shared__ int buffer[256];

	// ... ...

	if (i < 256)

	{

		// (do something to buffer)

	}

	__syncthreads();

	if (i < 256) {

		if (i < 5)

			a[i] = buffer[i*50];

		// ...

	}

	__syncthreads();

	// ... ...