How to use multithread to accumulate one variable

global static void testCUDA(int* device_block)
{
const int bid = blockIdx.x;
device_block[bid] += 1;
}

main
{
testCUDA<<<10,5>>(device_block);
}

The above is a part of my program.

When I print device_block as follow

for(i=0;i<=9;i++)
{
printf(“device_block[%d] = %d\n”,i,device_block[i]);
}

I expect my result is

device_block[0] = 5;
device_block[1] = 5;
device_block[2] = 5;
device_block[3] = 5;
device_block[4] = 5;
device_block[5] = 5;
device_block[6] = 5;
device_block[7] = 5;
device_block[8] = 5;
device_block[9] = 5;

But in fact, my result is

device_block[0] = 1;
device_block[1] = 1;
device_block[2] = 1;
device_block[3] = 1;
device_block[4] = 1;
device_block[5] = 1;
device_block[6] = 1;
device_block[7] = 1;
device_block[8] = 1;
device_block[9] = 1;

One block has 5 threads, so the kernel function will be executed for five times in one block. According to that, I think the device_block[block ID] will be accumulated to five. But in fact device_block[block ID] are always 1. I don’t know if i make any mistake .

The way you write you’ll have race conditions. A simple explanation is like this:

when you do “device_block[bid] += 1”, it’s actually not a single instruction on GPU (or most CPU), it’s most likely to be comprise of three instructions:

load r, device_block[bid]
add r, 1
store device_block[bid], r

So it’s possible that all five threads run the first instruction at the same time, so they all have the same value in their r (0, in this example). Then they all add their r by 1, which makes r to 1. Later, they all store their r into the same place, so device_block[bid] is 1, not 5.

To solve this problem you’ll need atomic operations, which ensure that “device_block[bid] += 1” is done in an atomic way. Atomic operations on global memory is supported in CUDA 1.1.

Atomic operations will work but what they will do is they will effectively serialize the accesses to memory, ie. make the threads execute one after another. It’s okay if such writes are rare and sparse. If your algorithm needs summing thousands of numbers, look into parallel reduction algorithm

Thank you for your reply.