[Newbie] Operations of type "+=" on the shared memory not working as expected

kriz · February 10, 2009, 2:03am

I am in the process of writing my first real CUDA program, but something is not quite right in the kernel. I have boiled down the problem to be analogous to the following statements…

__shared__ float Pot_sh[4];

Pot_sh[threadIdx.x]=0.0;

__syncthreads();

Pot_sh[threadIdx.x]+=threadIdx.y;

This is not really the application I’m trying to develop, its just that these few lines of code convey the intended idea.

The block is of dimensions (4,128).

Now when I run Pot_sh[threadIdx.x]+=threadIdx.y, I expect it to accumulate the sum of all threadIdx.y. But when I check the value (copying to device mem and then memcpy from host) I find the value to be 127 which is the last threadIdx.y.

From what I have read so far, and I am obviously overlooking something, I don’t know why the instructions won’t work. Are there any special rules to read-update operations in the shared memory?

P.S: I just realized that maybe every thread reads the value 0, increments it by its own y and writes it. Only the last write survives, so the value is 127. Is this inference correct? If yes, what is the usual way out to perform such reductions?

Mr_Nuke · February 10, 2009, 2:59am

Your conclusion is correct. Here’s what happens:

You initialize the value at zero.
Every thread reads the value in a register (let’s call it Reg)
In each thread, Reg == 0
Every thread then adds its threadIdx.y to Reg
In each thread, Reg == threadIdx.y
Every thread writes Reg back to the same location

You need to remember that the threads execute in parallel. This is a syncronization error, and one write is guaranteed to succeed, but not the rest, and which one succeeds is undefined. In your case, it happens to be the last one You would need some sort of syncronization mechanism, or to serialize the code. I believe atomic instructions may help you here, but I’m not the best person to ask about that. Now atomics will give you the equivalent of serial code masked as parallel, so you wouldn’t see a benefit from parallelization.

One way to solve the problem would be to have ty[0] add the value in ty[0] and in ty[1] and save it, ty[2] to addthe value in ty[2] and in ty[3] and save it, and so on…
Then ty[0] add the results obtained by ty[0] and ty[2], ty[4] add the results by ty[4] and ty[6], …
and so on…
until the last step when ty[0] adds the result by ty[0] and ty[64]

(ty[n] is a thread with y index n)

That’s the best you can do to parallelize this type of summation. It’s not the prettyest solution, and I think there’s a better way to organize the threads in order to limit the impact of divergence; however you will need to do at least log2(n) iterations.

Alex

kriz · February 10, 2009, 3:03am

Your conclusion is correct. Here’s what happens:

You initialize the value at zero.

Every thread reads the value in a register (let’s call it Reg)

In each thread, Reg == 0

Every thread then adds its threadIdx.y to Reg

In each thread, Reg == threadIdx.y

Every thread writes Reg back to the same location

You need to remember that the threads execute in parallel. This is a syncronization error, and one write is guaranteed to succeed, but not the rest, and which one succeeds is undefined. In your case, it happens to be the last one You would need some sort of syncronization mechanism, or to serialize the code. I believe atomic instructions may help you here, but I’m not the best person to ask about that. Now atomics will give you the equivalent of serial code masked as parallel, so you wouldn’t see a benefit from parallelization.

One way to solve the problem would be to have ty[0] add the value in ty[0] and in ty[1] and save it, ty[2] to addthe value in ty[2] and in ty[3] and save it, and so on…

Then ty[0] add the results obtained by ty[0] and ty[2], ty[4] add the results by ty[4] and ty[6], …

and so on…

until the last step when ty[0] adds the result by ty[0] and ty[64]

(ty[n] is a thread with y index n)

That’s the best you can do to parallelize this type of summation. It’s not the prettyest solution, and I think there’s a better way to organize the threads in order to limit the impact of divergence; however you will need to do at least log2(n) iterations.

Alex

I was just looking at the reduction example in the SDK and was afraid I would end up having to do something like that. I think Ill start off with a O(n) array summation for now and see how the performance is. Thanks for the input.