__threadfence() Can I use it for ...

Noel_Lopes · June 12, 2009, 6:50pm

I’m adding the elements of a big array into a single output. I’ll need to execute several blocks in order to compute the sum given the dimensions of the array.

Right now I’m executing the blocks in series, so I can guarantee that there are no simultaneous writes to the output.

Currently as I’m calling a single block at a time, my code looks something like this:

if (threadIdx.x == 0) *out += *shared_out;

My question is: Can I use __threadfence() to ensure there are no overlapping writes between two different blocks? Something like this:

if (threadIdx.x == 0) {

	   *out += *shared_out;

	   __threadfence();

}

Thanks

tmurray · June 12, 2009, 6:56pm

Oh __threadfence(), the most confusing instruction or intrinsic to probably ever appear anywhere.

It’s not a synchronization primitive, so no, it won’t do what you want here. However, if you did have another synchronization primitive and you didn’t want to use atomics, you would have to call threadfence before releasing the primitive to ensure that the write is visible to the now-active block.

Noel_Lopes · June 12, 2009, 7:40pm

thanks. Is there any other solution for this?

tmurray · June 12, 2009, 8:14pm

A critical section would do it. Not hard to implement with atomics…

cvnguyen · June 12, 2009, 10:42pm

Do not serialize your summation. The following example calculates the sum of a buffer in parallel:

#include <stdio.h>

#include <stdlib.h>

#define BLOCKSIZE 512

__device__ int sum;

__device__ long long timeval;

__global__ void cu_sum(void)

{

	__shared__ volatile int buffer[BLOCKSIZE];

	__shared__ int tmpsum;

	clock_t tmp = 0;

	buffer[threadIdx.x] = threadIdx.x;

	__syncthreads();

	if (threadIdx.x == 0)

		tmp = clock();

#if (0)

	atomicAdd(&tmpsum, buffer[threadIdx.x]);

#else

	int stride = BLOCKSIZE >> 1;

	while (stride != 0)

	{

		if (threadIdx.x < stride)

		{

			buffer[threadIdx.x] += buffer[threadIdx.x + stride];

		}

		stride >>= 1;

		__syncthreads();

	}

	if (threadIdx.x == 0)

		tmpsum = buffer[0];

#endif

	__syncthreads();

	if (threadIdx.x == 0)

	{

		timeval = clock() - tmp;

		sum = tmpsum;

	}

}

int main(void)

{

	dim3 grid, block;

	int hostsum;

	long long hosttime;

	grid.x = 1;

	block.x = BLOCKSIZE;

	cu_sum<<<grid, block>>>();

	cudaMemcpyFromSymbol(&hostsum, "sum", sizeof(int), 0, cudaMemcpyDeviceToHost);

	cudaMemcpyFromSymbol(&hosttime, "timeval", sizeof(long long), 0, cudaMemcpyDeviceToHost);

	printf("Sum = %d	  Total clocks = %ld\n", hostsum, hosttime);

	return 0;

}

Noel_Lopes · June 12, 2009, 11:12pm

Do not serialize your summation. The following example calculates the sum of a buffer in parallel:

#include <stdio.h>

#include <stdlib.h>

#define BLOCKSIZE 512

__device__ int sum;

__device__ long long timeval;

__global__ void cu_sum(void)

{

	__shared__ volatile int buffer[BLOCKSIZE];

	__shared__ int tmpsum;

	clock_t tmp = 0;

	buffer[threadIdx.x] = threadIdx.x;

	__syncthreads();

	if (threadIdx.x == 0)

		tmp = clock();

#if (0)

	atomicAdd(&tmpsum, buffer[threadIdx.x]);

#else

	int stride = BLOCKSIZE >> 1;

	while (stride != 0)

	{

		if (threadIdx.x < stride)

		{

			buffer[threadIdx.x] += buffer[threadIdx.x + stride];

		}

		stride >>= 1;

		__syncthreads();

	}

	if (threadIdx.x == 0)

		tmpsum = buffer[0];

#endif

	__syncthreads();

	if (threadIdx.x == 0)

	{

		timeval = clock() - tmp;

		sum = tmpsum;

	}

}

int main(void)

{

	dim3 grid, block;

	int hostsum;

	long long hosttime;

	grid.x = 1;

	block.x = BLOCKSIZE;

	cu_sum<<<grid, block>>>();

	cudaMemcpyFromSymbol(&hostsum, "sum", sizeof(int), 0, cudaMemcpyDeviceToHost);

	cudaMemcpyFromSymbol(&hosttime, "timeval", sizeof(long long), 0, cudaMemcpyDeviceToHost);

	printf("Sum = %d	  Total clocks = %ld\n", hostsum, hosttime);

	return 0;

}

Maybe I’m wrong but your function doesn’t seem to work for a grid.x greater than 1, which is what I need.

Noel_Lopes · June 12, 2009, 11:18pm

I have never work with atomics. Can you please post a simple example for this situation?

Also, I have read somewhere that atomic operations are slow. I have to sum around 20000 elements (this gives about 40 blocks of 512 threads). Would it be faster to have an array where to place the 40 results and then sum them up?

tmurray · June 12, 2009, 11:20pm

yes, a reduction is generally always faster.

Noel_Lopes · June 12, 2009, 11:29pm

thanks

cvnguyen · June 12, 2009, 11:54pm

That is just the code for one block. You would need extra code for organizing the grid.

Topic		Replies	Views
Question related __threadfence CUDA Programming and Performance	13	5091	January 12, 2016
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29326	April 22, 2015
Synchronize all blocks in CUDA CUDA Programming and Performance	12	45846	October 25, 2013
interblock sync without __threadfence() ? CUDA Programming and Performance	17	8475	May 7, 2009
Is there any alternative for __syncthreads() Your reply would help a lot CUDA Programming and Performance	2	4609	April 7, 2010
add elements of array has any body implemented CUDA Programming and Performance	2	3822	October 13, 2009
Reduction + Threadfence = does not work! Related to : GPU raytracer CUDA Programming and Performance	5	6219	January 27, 2012
__threadfence() problem CUDA Programming and Performance	2	9460	January 11, 2011
syncronize all threads from all blocks cudaThreadSynchronize() the only way ? CUDA Programming and Performance	11	8255	November 15, 2010
Doubt on __threadfence() require a detail description of this function. CUDA Programming and Performance	5	2936	January 25, 2010

__threadfence() Can I use it for ...

Related topics