Shared memory atomics and other questions.

I have threads organized as follows:
blockdim.X = 32
Blockdim.Y = some number.

This is done so hat every thread with threadidx.x =1, belongs to a new warp.

Now, assuming foo is an integer in shared memory, will the following code snippet lead to atomic writes:

if(threadidx.x ==1) foo ++;

The reasoning behind this, is that since only one thread per warp executes the instruction, and warps execute serially on an sm, the increment on foo should be serial as well, without the need for explicit shared memory atomic instructions… Is this correct?

Regards,
Debdatta Basu.

I have threads organized as follows:
blockdim.X = 32
Blockdim.Y = some number.

This is done so hat every thread with threadidx.x =1, belongs to a new warp.

Now, assuming foo is an integer in shared memory, will the following code snippet lead to atomic writes:

if(threadidx.x ==1) foo ++;

The reasoning behind this, is that since only one thread per warp executes the instruction, and warps execute serially on an sm, the increment on foo should be serial as well, without the need for explicit shared memory atomic instructions… Is this correct?

Regards,
Debdatta Basu.

No. [font=“Courier New”]foo++;[/font] translates to multiple machine instructions.
Even if it would not, a single instruction takes about 22 cycles to execute, which means that you have somewhere between 6 (on compute 1.x, where a new instruction is started every 4 cycles) and 44 (on Fermi, where two new instructions start to execute every cycle) instructions in flight at any time. On top of that, there is some superscalar execution of moves and muls (and general arithmetic of compute capability 2.1).
So even if the programming model appears simple, there are a lot of things going on in parallel, and only atomic operations can guarantee proper ordering of memory accesses.

No. [font=“Courier New”]foo++;[/font] translates to multiple machine instructions.
Even if it would not, a single instruction takes about 22 cycles to execute, which means that you have somewhere between 6 (on compute 1.x, where a new instruction is started every 4 cycles) and 44 (on Fermi, where two new instructions start to execute every cycle) instructions in flight at any time. On top of that, there is some superscalar execution of moves and muls (and general arithmetic of compute capability 2.1).
So even if the programming model appears simple, there are a lot of things going on in parallel, and only atomic operations can guarantee proper ordering of memory accesses.

@tera, Thanks for the reply… I have a problem though…

I dont understand how pipelining instructions from multiple warps should affect atomicity…
Even If two new instructions are pipelined on the cores every clock cycle, the instructions should execute in a serialized fashion… same goes with ILP, which, AFAIK, exists only for GF104…

I might be wrong as Im new to this, and if so, It would be insightful to know why.

Regards,
Debdatta Basu.

@tera, Thanks for the reply… I have a problem though…

I dont understand how pipelining instructions from multiple warps should affect atomicity…
Even If two new instructions are pipelined on the cores every clock cycle, the instructions should execute in a serialized fashion… same goes with ILP, which, AFAIK, exists only for GF104…

I might be wrong as Im new to this, and if so, It would be insightful to know why.

Regards,
Debdatta Basu.

What do you mean by “serialized fashion”? The important question for atomic access is wether a second thread can read the unmodified value before the first thread writes back the result of the atomic operation. As fetch and writeback are several stages apart in the pipeline, this is certainly the case.

ILP existed since compute 1.0, which was able to do a mul in its special function unit in parallel to an fmad (the infamous “missing mul”, because it often could not be scheduled because of register file bandwidth starvation), or load/store in parallel. GF104 was the first GPU however to be able to execute two fmads from the same thread in parallel (although again it seems to suffer from bandwidth starvation).

What do you mean by “serialized fashion”? The important question for atomic access is wether a second thread can read the unmodified value before the first thread writes back the result of the atomic operation. As fetch and writeback are several stages apart in the pipeline, this is certainly the case.

ILP existed since compute 1.0, which was able to do a mul in its special function unit in parallel to an fmad (the infamous “missing mul”, because it often could not be scheduled because of register file bandwidth starvation), or load/store in parallel. GF104 was the first GPU however to be able to execute two fmads from the same thread in parallel (although again it seems to suffer from bandwidth starvation).

Got it… Thanks. I somehow assumed that subsequent stages would have access to the modified value from the previous stages, which is wrong… Atomic operations will be needed…

However, In this case can we assume that at most 2 threads have access to the data location foo?

Regards,

Debdatta Basu.

Got it… Thanks. I somehow assumed that subsequent stages would have access to the modified value from the previous stages, which is wrong… Atomic operations will be needed…

However, In this case can we assume that at most 2 threads have access to the data location foo?

Regards,

Debdatta Basu.

Just when I thought the issue was resolved… Sigh…

I was looking through the gpu computing sdk, and I came across this in the whitepaper from the oclHistogram sample…

… Only G10x NVIDIA GPUs provide built-in support for workgroup-wide atomic

operations in local memory. But even on earlier G8x / G9x NVIDIA GPUs

local-memory atomic operations can be emulated basing on the fundamental fact

that work-groups are executed as subgroups of logically coherent work-items,

called warps , though “consistency domain” of such manually-implemented

atomic operations will also be limited by warp size, which is 32 work-items on

G8x / G9x / G10x NVIDIA GPUs…

isn’t this the same idea??? It must be that this works on g8x/9x/ g10x due to the fact that only one instruction is pipelined per multiprocessor every clock cycle, and will break down in fermi

due to the presence of 2 warp schedulers… Am I right?

Regards,

Debdatta Basu

Just when I thought the issue was resolved… Sigh…

I was looking through the gpu computing sdk, and I came across this in the whitepaper from the oclHistogram sample…

… Only G10x NVIDIA GPUs provide built-in support for workgroup-wide atomic

operations in local memory. But even on earlier G8x / G9x NVIDIA GPUs

local-memory atomic operations can be emulated basing on the fundamental fact

that work-groups are executed as subgroups of logically coherent work-items,

called warps , though “consistency domain” of such manually-implemented

atomic operations will also be limited by warp size, which is 32 work-items on

G8x / G9x / G10x NVIDIA GPUs…

isn’t this the same idea??? It must be that this works on g8x/9x/ g10x due to the fact that only one instruction is pipelined per multiprocessor every clock cycle, and will break down in fermi

due to the presence of 2 warp schedulers… Am I right?

Regards,

Debdatta Basu

Idea: The docs says, “when a scheduler issues double precision instruction, the other scheduler can not issue an instruction”

This means that the shared atomic operations trick should work on fermi too, provided foo is double precision… right? Though I doubt this would have any benifits over real atomic ops…

Btw Sorry for double posting. It was an accident…

Regars,
Debdatta.

Idea: The docs says, “when a scheduler issues double precision instruction, the other scheduler can not issue an instruction”

This means that the shared atomic operations trick should work on fermi too, provided foo is double precision… right? Though I doubt this would have any benifits over real atomic ops…

Btw Sorry for double posting. It was an accident…

Regars,
Debdatta.

bump…:)

bump…:)

If you look at the actual code of the example, it is based on collision detection, so it uses a different idea than what you described. It takes advantage of the fact that in case of a collision, within each warp exactly one of the writes is going to succeed. By storing a thread id together with the data, the collision detection can then figure out which thread was successful and retry the others.

This will work on Fermi as well, because the code is designed so that different warps work on separate histograms and never collide.

If you look at the actual code of the example, it is based on collision detection, so it uses a different idea than what you described. It takes advantage of the fact that in case of a collision, within each warp exactly one of the writes is going to succeed. By storing a thread id together with the data, the collision detection can then figure out which thread was successful and retry the others.

This will work on Fermi as well, because the code is designed so that different warps work on separate histograms and never collide.

@Tera

Thanks for the reply… Youre right… :) I realized that a few hours after bumping the thread…

Thanks a ton… It has been very helpful…

-Debdatta Basu.

@Tera

Thanks for the reply… Youre right… :) I realized that a few hours after bumping the thread…

Thanks a ton… It has been very helpful…

-Debdatta Basu.