Parallel Programming Problem Sum values into one array-cell

Hello all together,

I’m trying to solve the following problem:

The following explanation is a bit longish. The short version is:

I have got some write-after-write errors due to parallization. You can read the long version or directly look at the code:

We have got an array of arrays (matrix) whose euklidean distances I want to calculate to each other:

Each row of the matrix descriptively accords to a 64dimensional vector.

If I want to calculate the squared distance of the first-row-vector to the second-row-vector for example, I have to sum up all 64 squared differences. This sum should me accumulated into a certain field of another array.

Theres the problem! Caused by prallelity there write-after-write’s so that there’s no sum in thee target field but only 0 (the starting value) + any ONE squared difference (mostly the 64th difference squared).

Here is my kernel:

__global__ void square_array(int *a, int *b, int candidate, int N, int linescount)


	//we assume that candidate==0

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	if ((64<idx)&&(idx<67)) {

		int diff = (a[idx] - a[idx%64]);

		//Here comes the problematic line: the value of b[...] is not beeing updated so that another thread uses this new value.

		b[(idx-idx%64)/64] = b[(idx-idx%64)/64] + diff*diff;


	//b[(idx-idx%64)/64] = b[(idx-idx%64)/64] + diff*diff;


How could I solve this problem? I already tried around with __syncthreads() but had no success.

you need to use atomicAdd().

So, to abstract from computing distances, matrices, etc, etc, if I read it correctly the thing you want to compute at some point is


where values v[t] is stored in thread t. Am I right?

In order to do that, you need a reduction algorithm. To do that, you store data in shared memory and then at each step, each thread computes a sum between two values and outputs a single value. In logarithmic time you get the total sum. Consider the following code:

int i=threadIdx.x;

if (i<32) {







As a result v[0] will hold the sum of all elements in the array. v[1…31] will hold some garbage, v[32…63] won’t be changed.

Node that I employ only one warp to perform the summation above, so I don’t need to call __syncthreads() between additions.

Also note, if you sum it this way, you get no bank conflicts at all :)