__threadfence() Can I use it for ...

I’m adding the elements of a big array into a single output. I’ll need to execute several blocks in order to compute the sum given the dimensions of the array.

Right now I’m executing the blocks in series, so I can guarantee that there are no simultaneous writes to the output.

Currently as I’m calling a single block at a time, my code looks something like this:

if (threadIdx.x == 0) *out += *shared_out;

My question is: Can I use __threadfence() to ensure there are no overlapping writes between two different blocks? Something like this:

if (threadIdx.x == 0) {

	   *out += *shared_out;




Oh __threadfence(), the most confusing instruction or intrinsic to probably ever appear anywhere.

It’s not a synchronization primitive, so no, it won’t do what you want here. However, if you did have another synchronization primitive and you didn’t want to use atomics, you would have to call threadfence before releasing the primitive to ensure that the write is visible to the now-active block.

thanks. Is there any other solution for this?

A critical section would do it. Not hard to implement with atomics…

Do not serialize your summation. The following example calculates the sum of a buffer in parallel:

#include <stdio.h>

#include <stdlib.h>

#define BLOCKSIZE 512

__device__ int sum;

__device__ long long timeval;

__global__ void cu_sum(void)


	__shared__ volatile int buffer[BLOCKSIZE];

	__shared__ int tmpsum;

	clock_t tmp = 0;

	buffer[threadIdx.x] = threadIdx.x;


	if (threadIdx.x == 0)

		tmp = clock();

#if (0)

	atomicAdd(&tmpsum, buffer[threadIdx.x]);


	int stride = BLOCKSIZE >> 1;

	while (stride != 0)


		if (threadIdx.x < stride)


			buffer[threadIdx.x] += buffer[threadIdx.x + stride];


		stride >>= 1;



	if (threadIdx.x == 0)

		tmpsum = buffer[0];



	if (threadIdx.x == 0)


		timeval = clock() - tmp;

		sum = tmpsum;



int main(void)


	dim3 grid, block;

	int hostsum;

	long long hosttime;

	grid.x = 1;

	block.x = BLOCKSIZE;

	cu_sum<<<grid, block>>>();

	cudaMemcpyFromSymbol(&hostsum, "sum", sizeof(int), 0, cudaMemcpyDeviceToHost);

	cudaMemcpyFromSymbol(&hosttime, "timeval", sizeof(long long), 0, cudaMemcpyDeviceToHost);

	printf("Sum = %d	  Total clocks = %ld\n", hostsum, hosttime);

	return 0;


Maybe I’m wrong but your function doesn’t seem to work for a grid.x greater than 1, which is what I need.

I have never work with atomics. Can you please post a simple example for this situation?

Also, I have read somewhere that atomic operations are slow. I have to sum around 20000 elements (this gives about 40 blocks of 512 threads). Would it be faster to have an array where to place the 40 results and then sum them up?

yes, a reduction is generally always faster.


That is just the code for one block. You would need extra code for organizing the grid.