Writes in same memory location Cant add numbers from different threads?

Morph208 · June 4, 2007, 6:56am

I have a little concern. I would like to be able to add numbers which I compute in different threads in the same location in global memory. Actually I’m working on a finite element solver. And I’m doing computations on every elements. For each element I compute its force contribution for each of its nodes. And I would like to write each result in one array in memory like:

force[NodeNumber1] += computed_value1;

force[NodeNumber2] += computed_value2;

force[NodeNumber3] += computed_value3;

force[NodeNumber4] += computed_value4;

Since I got elements who share nodes, different threads might try to add their value in the same location in memory in the same time. In the programming guide we can read:

So it means my result is not guaranteed to be correct within the warp, right? Is it guarenteed between wraps? Is it possible to do what I want to do and be sure that every values will have been added?

osiris1 · June 4, 2007, 8:14am

Yes, results not correct within a warp or from different warps either as the += is not an atomic op. You can do what you want to do - within a warp see histogram example (type it into search box) that is good for 27 bit ints summing within a warp into shared memory and I propose a solution for tagged floats in global memory that should work (IMHO) in all circumstances, provided an align(8) structure write occurs as an atomic write operation of 64 bits (which seems the only way to implement it, but the guide is too thin to tell us):

struct __align__8 {int tag; float acc;} taggedAcc[NNN];

Read your target accumulator struct; sync; add your result to the float part, put your tid into the tag int and write it; sync; read and check for your tag, looping back to the sync before the add until you have written successfully. Going to be pretty damn slow. You can’t do this into shared memory as there are no 64bit store ops into shared memory (I believe).
Eric
Perhaps Nvidia could offer a confirmation…

ed: except that you can’t put a sync in divergent code so the above sequence has to be repeated a fixed number of times which is defined by the maximum number of collisions that you can have in your particular problem.

Morph208 · June 4, 2007, 11:49pm

Thank you Eric. It sounds tricky but it seems to be the only solution since the hardware doesn’t allowed atomic operations. I’m scared of the speed I will get. Unless time will be covered by the global memory latency. I will try this implementation anyway. Let’s hope it will be fast enough for my application. Otherwise I will have to re-implement my algorithm in a gather fashion with 2 passes (writing in different arrays and then adding them in another kernel).

Morph208 · June 5, 2007, 1:57am

I tried. Here is my code:

__global__ void Kernel1(float Mu, float Lambda, float* Force_array, int sizeArray)

{

    /// Structure array for simulating atomic operations

    __device__ struct __align__8 {int tag; float acc;} force;

...

But the compiler didn’t like that:

free(force) at the end of the kernel doesn’t work. No additional error but it didn’t solve my problem. And if I remove the device word before I got this error this time:

Any help would be appreciated…

pyrtsa · June 5, 2007, 5:26am

I think it’s the __align__8 that should be align(8).
/Pyry

Morph208 · June 5, 2007, 5:30am

Nope… The brackets didn’t change anything… :(

Morph208 · June 5, 2007, 5:36am

Sorry for answering twice in a row. But actually I’ve just noticed something. It seems to be caused by the dynamic declaration. Indeed if I try let’s say:

struct __align__(8) {int tag; float acc;} force[3000];

The compiler no longer tells me something. I don’t understand why but it seems to be the solution.

osiris1 · June 5, 2007, 10:46pm

After you have finished your coding you need to check that you have a vector store in your ptx code (the guide does not mention that 64/128bit stores exist at all) - if you do offset from a pointer I could get something like:
st.global.v2.u32 [$r8+8], {$r10,$r12};
There is no guarantee that ptxas won’t mangle it… (why confirm from Nvidia is required)
Eric

Morph208 · June 6, 2007, 3:34am

Should the code look like this?

Initialisation:

struct __align__(8) accumulator {int tag; float acc;};

accumulator * force = 0;

cudaMalloc((void**)&force, 4*NumNodes*sizeof(accumulator));

cudaMemset(force, 0, 4*NumNodes*sizeof(accumulator));

Inside my kernel:

        do

         {

             tag = force[NodeNumber].tag;

             acc = force[NodeNumber].acc;

             __syncthread();

             tag = tid;

             acc += F;

             force[NodeNumber].tag = tag;

             force[NodeNumber].acc = acc;

             __syncthread();

         } 

         while ( force[NodeNumber].tag != tid );

within a loop on the maximum number of collisions I can get. Maybe I misunderstood. But I would say it won’t fix my problem. Because 2 writes happen: the tag and the value. I check the tag assuming the tag will change if and only if the value has changed as well. But I believe that writing the tag is undefined. And writing the value is undefined too. But I have no guarantee it will behave the same way. It might work for the tag but not for the value for instance. And if I’m writing directly a structure it would still be the same because the operator = would split operations as well. Or am I wrong?

I think I need to code in the value I try to write, the ID of the thread in one fashion. But I can’t do the same trick which used for integers in the histogram method in another topic of this forum. Because I’m adding numbers, I don’t just want to know what thread has written. In my case several threads might manage to write so I need to know the thread list of those who succeed to write. Thus I will know what threads need to re-try to write because it has failed. I start to think it’s not possible.

Morph208 · June 6, 2007, 3:37am

I didn’t understand. What is that line: [I]st.global.v2.u32 [$r8+8], {$r10,$r12};" ? And I just read about ptx code somewhere else on that forum but I don’t really know what it is. I need more details on that one :)

osiris1 · June 6, 2007, 4:43am

OK, was hoping not to have to write code:

__global__ void

test(int i, float f)

{

    const int   tid = threadIdx.x;

    float2      a;

    float2*     p;

    int         done = 0;

   p = (float2*)force + i;

    a = p[0];

    __syncthreads();

    a.x = tid;

    a.y += f;

    p[0] = a;

    //------------------------------------ repeat this the required number of times

    __syncthreads();

    a = p[0];

    if (a.x == tid) done = 1;

    __syncthreads();

    if (!done)

    {

        a.x = tid;

        a.y += f;

        p[0] = a;

    }

    //------------------------------------

}

compiles to (using nvcc -ptx) in part:

#   5  __global__ void

 #   6  test(int i, float f)

$LBB1_test:

 #      .loc    2       14      0

 #  10      float2*     p;

 #  11      int         done = 0;

 #  12

 #  13      p = (float2*)force + i;

 #  14      a = p[0];

        ld.param.u32    $r1, %parm_i;           #  id:46 %parm_i+0x0

        mul.lo.u32      $r2, $r1, 8;            #

        mov.u32         $r3, (&force);          #

        add.u32         $r4, $r2, $r3;          #

        ld.global.v2.f32        {$f4,$f1}, [$r4+0];     #

 #      .loc    2       15      0

 #  15      __syncthreads();

        bar.wait        0;                      #

 #      .loc    2       17      0

 #  16      a.x = tid;

 #  17      a.y += f;

        ld.param.f32    $f2, %parm_f;           #  id:48 %parm_f+0x0

        add.f32         $f1, $f1, $f2;          #

 #      .loc    2       18      0

 #  18      p[0] = a;

        cvt.u32.u16     $r5, %tid.x;            #

        cvt.rn.f32.s32  $f3, $r5;       #

        st.global.v2.f32        [$r4+0], {$f3,$f1};     # 64 bit write

 #      .loc    2       20      0

 #  19      //------------------------------------

 #  20      __syncthreads();

        bar.wait        0;                      #

 #      .loc    2       21      0

 #  21      a = p[0];

        ld.global.f32   $f1, [$r4+4];   #  id:53 force+0x0

 #      .loc    2       23      0

 #  22      if (a.x == tid) done = 1;

 #  23      __syncthreads();

        bar.wait        0;                      #

        setp.eq.f32     $p1, $f3, $f4;          #

        @$p1 bra        $Lt_0_5;                #

$LBB2_test:

 #      .loc    2       27      0

 #  24      if (!done)

 #  25      {

 #  26          a.x = tid;

 #  27          a.y += f;

        add.f32         $f1, $f1, $f2;          #

        st.global.v2.f32        [$r4+0], {$f3,$f1};     # 64 bit write

$Lt_0_5:

 #      .loc    2       28      0

 #  28          p[0] = a;

        exit;                           #

I did have a go with the accumulator struct but it would not work?? float2 is just as good, as I mentioned earlier you can’t put a __syncthreads() inside a possibly divergent while loop.

Eric

ed: just had a confusing value in the example

ed: and just for comparison this is 0.9 tools:

      .entry test

        {

        .reg .u32 $r1,$r2,$r3,$r4,$r5;

        .reg .f32 $f1,$f2,$f3,$f4;

        .reg .pred $p0,$p1;

        .param .s32 __cudaparm_i;

        .param .f32 __cudaparm_f;

        .loc    2       6       0

$LBB1_test:

        .loc    2       14      0

        ld.param.u32    $r1, [__cudaparm_i];    //  id:46 __cudaparm_i+0x0

        mul.lo.u32      $r2, $r1, 8;            //

        mov.u32         $r3, force;             //

        add.u32         $r4, $r2, $r3;          //

        ld.global.v2.f32        {$f4,$f1}, [$r4+0];     //

        .loc    2       15      0

        bar.wait        0;                      //

        .loc    2       17      0

        ld.param.f32    $f2, [__cudaparm_f];    //  id:48 __cudaparm_f+0x0

        add.f32         $f1, $f1, $f2;          //

        .loc    2       18      0

        cvt.s32.u16     $r5, %tid.x;            //

        cvt.rn.f32.s32  $f3, $r5;       //

        st.global.v2.f32        [$r4+0], {$f3,$f1};     //

        .loc    2       20      0

        bar.wait        0;                      //

        .loc    2       21      0

        ld.global.f32   $f1, [$r4+4];   //  id:53 force+0x0

        .loc    2       23      0

        bar.wait        0;                      //

        setp.eq.f32     $p1, $f3, $f4;          //

        @$p1 bra        $Lt_0_5;                //

        .loc    2       27      0

        add.f32         $f1, $f1, $f2;          //

        st.global.v2.f32        [$r4+0], {$f3,$f1};     //

$Lt_0_5:

        .loc    2       28      0

        exit;                           //

Morph208 · June 6, 2007, 5:43am

Thanks. I think I understood what you’re saying now. But I still think it won’t produce the correct result everytime. What do you think of that?

Let’s say thread 1 does that using the method you explained:

a.x = 0;

a.y += f1;

Thread 2 does:

a.x = 1;

a.y += f2;

According to the programming guide we know that:

If these 2 threads write in the same location in memory, It would mean we have 3 possible results: the global force is either f1 or f2 or f1+f2. It can’t be nothing since one write is guaranteed to succeed. Your method allows me to detect the 2 first cases. But if both writes happen, thanks to the tag you’re gonna detect the last thread who has written. But you won’t know that both did. And if you detect thread1 for instance, you’re gonna write again the thread2 and therefore adding that value twice.

Or is my thinking not correct?

Morph208 · June 6, 2007, 8:10am

Actually I made a mistake. This should work. We don’t care if there are several writes. The last one will have worked. And at the end of the loop every numbers will be added. I’m implementing that now. I let you know.

osiris1 · June 7, 2007, 12:48am

Actually I think you are snookered here with compiler bugs. If you look at the business code:

       bar.wait        0;                      //

        .loc    12      19      0

        ld.global.f32   $f1, [$r4+4];   //  id:53 force+0x0

        .loc    12      21      0

        bar.wait        0;                      //

        setp.eq.f32     $p1, $f3, $f4;          //

        @$p1 bra        $Lt_0_5;                //

        .loc    12      25      0

        add.f32         $f1, $f1, $f2;          //

        st.global.v2.f32        [$r4+0], {$f3,$f1};     //

$Lt_0_5:

Then you will notice that the compiler has optimised away re-reading p[0].x between the two __syncthreads() (bar.wait 0) so you never get to find out who wrote last. The compiler should always assume shared and global memory will get changed over a __syncthreads() and cannot hold any copy in registers then. If you try to persuade it by decl “volatile float2* p” you get the false error:

"test6.cu", line 12: error: no operator "=" matches these operands

            operand types are: float2 = volatile float2

      a = p[0];

        ^

So that’s 2 compiler bugs (same on 0.8 and 0.9). Any feedback for us Nvidia? (also on whether 64 bit writes are preserved all the way through to the hardware)

Thanks, Eric

Morph208 · June 7, 2007, 3:15am

Actually it seems to work for me. Here is my code:

   float2 temp;

    

    for (int i=0; i<3; i++)

    {

        int done = 0;

        

        temp = force[4*NodeNumber+i];

        __syncthreads();

        temp.x = tid;

        temp.y += F[i];

        force[4*NodeNumber+i] = temp;

        

        for (int q=0; q < VALENCE; q++)

        {        

            __syncthreads();

            temp = force[4*NodeNumber+i];

            // Check if the write succeed

            if ( force[4*NodeNumber+i].x == tid )

            {

                done = 1;

            }

            __syncthreads();

            // If not, write again

            if ( !done )

            {

                temp.x = tid;

                temp.y += F[i];

                force[4*NodeNumber+i] = temp;

            }

        }

    }

The for loop on i is for compute each compoment of my force vector. My force array is declared like that:

   float2* force = 0;

    cudaMalloc((void**)&force, 4*NumNodes*sizeof(float2));   

    cudaMemset(force, 0, 4*NumNodes*sizeof(float2));

My algorithm seems to work, and seems stable as far as I can tell. It’s slow so far because my kernel is using 38 registers so I’m trying to decrease that number. But the result seems correct. The compiler doesn’t do the optimization you got:

# 287      for (int i=0; i<3; i++)

 # 288      {

 # 289          int done = 0;

        mov.s16         $rh1, 0;                #

        mov.s32         $r88, 0;                #

 #      .loc    10      292     0

 # 290

 # 291

 # 292          temp = force[4*NodeNumber+i];

        ld.global.f32   $f398, [$r87+4];        #  id:1345

 #      .loc    10      293     0

 # 293          __syncthreads();

        bar.wait        0;                      #

 #      .loc    10      295     0

 # 294          temp.x = tid;

 # 295          temp.y += F[i];

        ld.local.f32    $f399, [$r83+0];        #  id:1346 F$0+0x0

        add.f32         $f398, $f399, $f398;    #

        st.global.v2.f32        [$r87+0], {$f397,$f398};        #

 #      .loc    10      299     0

 # 296          force[4*NodeNumber+i] = temp;

 # 297

 # 298

 # 299          for (int q=0; q < VALENCE; q++)

        mov.s16         $rh2, 0;                #

        mov.s32         $r89, 0;                #

$Lt_0_50:

 #<loop> Loop body line 299, nesting depth: 2, iterations: 8

 #      .loc    10      301     0

 # 300          {

 # 301              __syncthreads();

        bar.wait        0;                      #

        ld.global.v2.f32        {$f400,$f398}, [$r87+0];        #

 #      .loc    10      304     0

 # 302              temp = force[4*NodeNumber+i];

 # 303              // Check if the write succeed

 # 304              if ( force[4*NodeNumber+i].x == tid )

        mov.s32         $r90, 1;                #

        setp.eq.f32     $p2, $f400, $f397;      #

        selp.s32        $r88, $r90, $r88, $p2;  #

 #      .loc    10      308     0

 # 305              {

 # 306                  done = 1;

 # 307              }

 # 308              __syncthreads();

        bar.wait        0;                      #

        mov.s32         $r91, 0;                #

        setp.ne.s32     $p3, $r88, $r91;        #

        @$p3 bra        $Lt_0_51;               #

$LBB15_Kernel1:

 #<loop> Part of loop body line 299, head labeled $Lt_0_50

 #      .loc    10      313     0

 # 309              // If not, write again

 # 310              if ( !done )

 # 311              {

 # 312                  temp.x = tid;

# 313                  temp.y += F[i];

        add.f32         $f398, $f399, $f398;    #

        st.global.v2.f32        [$r87+0], {$f397,$f398};        #

$Lt_0_51:

 #<loop> Part of loop body line 299, head labeled $Lt_0_50

 #      .loc    10      299     0

        add.s32         $r89, $r89, 1;          #

        mov.s32         $r92, 8;                #

        setp.ne.s32     $p4, $r89, $r92;        #

        @$p4 bra        $Lt_0_50;               #

$LBB17_Kernel1:

 #<loop> Part of loop body line 146, head labeled $Lt_0_47

 #      .loc    10      287     0

        add.u32         $r87, $r87, 8;          #

        add.u32         $r83, $r83, 4;          #

        setp.ne.s32     $p5, $r83, $r84;        #

        @$p5 bra        $Lt_0_47;               #

$LBB18_Kernel1:

        mul.lo.s32      $r93, $r39, 4;          #

        mov.u32         $r94, (&F$0);           #

        mul.lo.u32      $r95, $r93, 8;          #

        add.u32         $r96, $r95, $r86;       #

$Lt_0_57:

 #<loop> Loop body line 287, nesting depth: 1, iterations: 3

 #      .loc    10      323     0

I believe that line 301 the instruction ld.global.v2.f32 {$f400,$f398}, [$r87+0]; actually read the value from the memory between the 2 syncthread(). Is my understanding is correct?

osiris1 · June 7, 2007, 3:46am

Hey, you’re lucky! Comparing against the global and not temp.x got you the right code! In 0.9 just putting the for loop in also fixes the problem… my example is still broken.

Confirmation on 64 bit writes?

Eric

pyrtsa · June 7, 2007, 5:47am

Hi,

If you try to persuade it by decl “volatile float2* p” you get the false error:
"test6.cu", line 12: error: no operator "=" matches these operands

            operand types are: float2 = volatile float2

      a = p[0];

        ^

This must be because the default assignment operator (“=”) takes anargument of “const T &” type by default, and a volatile object cannot be treated as const-non-volatile but only as volatile or const-volatile. CUDA’s float2 is a “C” struct and hence there is no operators in the definition.

I think there are two workarounds for this:

1) A C+±style solution: Make a float2 class or struct of your own. Then you can define multiple assignment operators like:

#define MY_DECL __device__ __host__ __inline__

struct my_float2 {

    float x, y;

    MY_DECL my_float2 & operator=(const my_float2 & r) {

        x = r.x; y = r.y;

    }

    MY_DECL my_float2 & operator=(const volatile my_float2 & r) {

        x = r.x; y = r.y;

    }

    // ...

};

#undef MY_DECL

2) Use function make_float2 for your volatile object:

float2 tmp = make_float2(a_volatile_float2.x, a_volatile_float2.y);

/Pyry

edit: Added MY_DECL to make it work in device code.

Morph208 · June 7, 2007, 7:05am

No actually even with temp.x my code is still working fine… And I’m using 0.8 for information.

osiris1 · June 7, 2007, 11:34pm

Thanks, Pyry I did not stop to think, was a bit frustrated by the time I got to that point… only place one should need volatile is when using SIMD within a warp (as you pointed out previously).

Morp208: I think if you check, either referencing the global location OR putting in the for loop corrects the problem (that was what I found on 0.9). You should really design your algorithm not to require this code! It is not safe to use unless NVIDIA give the OK as there is too much that is hidden from the ptx to the hardware. I thought it should work and was an interesting extension to the histogram example. The new atomic ops should be used when you can but they don’t do float adds so this code may still have some application.

Can’t you organise the summing so that each element sums to a different relative node at the same time - that way all you need is some syncs.

Eric

Morph208 · June 8, 2007, 12:07am

I thought about that. But it’s not really feasible I think. It would mean organising my threads within each wrap correctly for not having any collision. And my algorithm is a finite element solver. It’s really not trivial to organise everything for not having any element within each wrap who share nodes. It’s depending on my mesh. And if I increase the number of elements I will have more element sharing nodes as well. So figuring out one way to do that properly is really not easy problem. I have already found about it. And it’s the last step before changing my algorithm.

But perhaps Nvidia can confirm everything which has been said in that topic?

Topic		Replies	Views
Problem with correct branching within a warp CUDA Programming and Performance	23	15643	May 28, 2009
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8781	December 1, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55564	March 8, 2007
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	12477	February 7, 2023
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19757	January 21, 2010
pointer to shared memory compiler problems CUDA Programming and Performance	19	14571	June 7, 2008
Program works on GTX 260 but not on GTX 480 CUDA Programming and Performance	34	3267	November 24, 2010
Global thread barrier CUDA Programming and Performance	78	85684	December 23, 2011
coalescing struct loading problem CUDA Programming and Performance	21	12720	March 5, 2010

Writes in same memory location Cant add numbers from different threads?

Related topics