Writes in same memory location Cant add numbers from different threads?

e.ping · June 25, 2007, 3:48pm

Morph208:

I am facing the same problem like you are. I am trying to implement the method suggested. I had one question regarding your two for loops. I understand the topmost one (i loop). But couldn’t get the second one (q loop). Is this the number of attempts that you make to try to update the force array in the thread? If so, what is the value of the VALENCY did you use to make it work ?

Thanks,

ST

Morph208 · June 25, 2007, 11:34pm

Hi Shantha,

You’ve well understood, the second loop is to make sure the update occurs. The valence must match to your worse case. Sadly you can’t make a loop, check if the write occurs and do two different things afterwards (write or not write) and thus stopping to try to write if it’s already worked. Because with a such loop you’re gonna cause a divergence. And since before writing you need to synchronize (the execution is suspended till all the threads reach the synchronization point). And because of the divergence the threads which managed to write won’t be able to reach the synchronization point and this will crash your computer.
Instead you have to try writing as many times as your worse case. What is your maximum of collision you can get? This is this number you must choose as valence. I’ve called it valence because in my case I’m working on a finite element solver and this number matches to the valence of my mesh (the maximum number of elements with which a node can be bound).

Although it seems to produce a correct result, be aware that this is far from being efficient and thus this is very slow. At least with cuda 0.8 which has some bugs with global memory. You would probably be more efficient if you re-organise your algorithm in a different fashion.

vpodlozhnyuk · June 26, 2007, 1:52am

   float2 temp;

    

    for (int i=0; i<3; i++)

    {

        int done = 0;

        

        temp = force[4*NodeNumber+i];

        __syncthreads();

        temp.x = tid;

        temp.y += F[i];

        force[4*NodeNumber+i] = temp;

        

        for (int q=0; q < VALENCE; q++)

        {        

            __syncthreads();

            temp = force[4*NodeNumber+i];

            // Check if the write succeed

            if ( force[4*NodeNumber+i].x == tid )

            {

                done = 1;

            }

            __syncthreads();

            // If not, write again

            if ( !done )

            {

                temp.x = tid;

                temp.y += F[i];

                force[4*NodeNumber+i] = temp;

            }

        }

    }

The code is very unlikely to work for arbitrary case, since shared memory update is non-atomic (subdivided into many instructions), and the only thing we are guaranteed about scheduling is each thread within a warp always executes the same instruction in SIMD fashion, but different warps execute different instructions until __syncthreads(). So if collisions stay within warps, tagging data and waiting until shared memory write succeeds is guaranteed to work, but collisions between warps can’t be reliably resolved by such approach. Have you tried “worst case” input data (giving maximum possible level of collisions)? Also each thread should not just calculate new shared memory value once and wait until the write succeeds, but respin the increment after write failure, basing on new shared memory value, written by the last “lucky” thread colliding to the same node.

Morph208 · June 26, 2007, 3:53am

First of all this code has been done using global memory. This doesn’t change anything about the non-ability to perform atomic operations. But this means the code works within a block and not within a warp. So you need to gather all collisions within blocks. Since the order of blocks running on the device is undefined, you can’t play with the order of the blocks to avoid collisions. But if your collision are gathered within blocks, I think this code’s working. No?

What do you mean by

? The loop for assuring writes is already done on the worse case (valence is the maximum number of collision I can get).

But I don’t need to

. In the beginning of the q loop, I read again the value from global memory before trying to increment it, specially to grab the latest value written by the “lucky” thread.

osiris1 · June 26, 2007, 5:21am

Yes Victor has his wires crossed - this code works as long as the pseudo op st.global.v2.f32 will always write 64 bits in an atomic operation. That is what needs to be confirmed by Nvidia, also a confirmation for st.global.v4 would be useful as well. We need to know that this will not be broken in future generations of the hardware though its requirement may disappear when atomic ops can do float operations (which has been promised somewhere here - not in compute 1.1).

Definitely does not work into shared mem where all writes are 32 bit. Atomic ops don’t work into shared mem either.

Morph208 - you will probably get a performance improvement by unrolling your loop - only needs VALENCE-1 times. I think backward branches are relatively slow here as there is no branch prediction. If you could do all the writes for each direction into a node separately then perhaps you might not need this (I have never done FEM).

Eric

Morph208 · June 26, 2007, 5:42am

Osiris> I’ve abandonned this approach. I came back to a gather implementation. This algorithm was way too slow for my application. But I’ve learned that 0.8 had bugs with global memory (I’ve measured my device-device memory bandwidth at only 9Gb/s under 0.8) and it might explain why it was so slow as well. So I might try again this algorithm with the 0.9 release if I can figure out a way to sort my elements by gathering all collisions within blocks. I don’t even know if it’s possible.

vpodlozhnyuk · June 26, 2007, 1:59pm

First of all this code has been done using global memory. This doesn’t change anything about the non-ability to perform atomic operations. But this means the code works within a block and not within a warp. So you need to gather all collisions within blocks. Since the order of blocks running on the device is undefined, you can’t play with the order of the blocks to avoid collisions. But if your collision are gathered within blocks, I think this code’s working. No?

What do you mean by ? The loop for assuring writes is already done on the worse case (valence is the maximum number of collision I can get).

But I don’t need to . In the beginning of the q loop, I read again the value from global memory before trying to increment it, specially to grab the latest value written by the “lucky” thread.

[snapback]214385[/snapback]

Oh, sorry, I was writing this late at night. On the second sight there’s nothing forbidden in the code, all __syncthreads() are there and respins are in fact done. Worst collision case is when all threads collide at the same node, thus requiring full VALENCE spin count until the update is finished (without collisions each thread does the only “spin”).

Yes, as long as align(8) option is supplied in type definition and total size of elements in the structure is 8 byte (temporary compiler limitation actually), as with built-in float2, single reads and writes are atomic. The same holds for 16-byte types like float4. 4, 8 and 16 byte global reads/writes are natively supported by G80.

But anyway because of data-dependent global reads and just because of this for(i = 0; i < 3; i++) loop your global memory reads are totally noncoalesced, and memory performance is extremly poor. Even coalesced global memory read costs hundreds of clocks per warp, the price paid for noncoalesced reads can be much higher(by a factor of 16 in worst case). Please refer to alignedTypes CUDA SDK 1.0 sample, demonstrating the difference between align’ed (coalesced) and non-align’ed (non-coalesced) global memory reads/writes. At the same time reads and writes to shared memory have the same price as local register move or any other primitive math operation(MAD, bitwise, …) in case of bank conflict-free access pattern: 4 clocks per warp. Even in the worst case of 16-way bank conflict shared reads and writes cost “only” 64 clocks per warp, which is still much faster than coalesced global read. I just didn’t expect someone is going to do such fuzzy data-dependent tricks in global memory, which are usually done in shared memory.

Maybe I’m still missing something, but you haven’t posted full kernel code or NodeNumber calculation at least, so I assume it’s just some value derived from input data.

As you mentioned on your own, it works (especially if CUDA results pass validation by “gold” CPU implementation), but far too slow to be of any use.

e.ping · June 26, 2007, 8:38pm

I have implemented this in slightly different way requiring lesser time. I still get variance in the values between emulation run and device run. But it is much better than before.

float2 temp;

for (int i=0; i<3; i++)
{
int done = 0;

   temp = force[4*NodeNumber+i];
   __syncthreads();
   temp.x = tid;
   temp.y += F[i];
   force[4*NodeNumber+i] = temp;
   
   for (int q=0; q < VALENCE && !done; q++)
   {        
       __syncthreads();
       temp = force[4*NodeNumber+i];
       // Check if the write succeed
       if ( force[4*NodeNumber+i].x == tid )
       {
           done = 1;
       }
       __syncthreads();
       // If not, write again
       if ( !done )
       {
           temp.x = tid;
           temp.y += F[i];
           force[4*NodeNumber+i] = temp;
       }
   }

}

Morph208 · June 26, 2007, 11:15pm

Yes you’re right, adding !done is better of course. But you can still observe some differences between emulation and device run because of the collisions you get between the blocks. And the only way to avoid this is sorting your data.

osiris1 · June 27, 2007, 1:49am

C’mon guys you can’t put a __syncthreads() inside a possibly divergent conditional. As mentioned above - so the &&!done is illegal.

I thought this should be the case too, which is why I offered it first, but on testing it does not work - if the types in the structure are different then vector ops are not generated - so I guess this is a compiler bug report (does not affect my work so will not post an official bug report):

typedef struct __align__(8){int x; float y;} Acc;

#define TYP     Acc

__device__ TYP force[3000];

__global__ void

kernel(int i, float f)

{

    const int   tid = threadIdx.x;

    TYP         a;

    TYP*        p;

    int         done = 0;

    int         j;

   p = force + i;

    a = p[0];

    __syncthreads();

    a.x = tid;

    a.y += f;

    p[0] = a;

    for (j = 0; j < 4; ++j)

    {

        __syncthreads();

        a = p[0];

        if (a.x == tid) done = 1;

        __syncthreads();

        if (!done)

        {

            a.x = tid;

            a.y += f;

            p[0] = a;

        }

    }

}

Compiles to (in 1.0, in part, just for loop):

$Lt_0_9:

 //<loop> Loop body line 20, nesting depth: 1, iterations: 4

        .loc    2       23      0

        bar.sync        0;                      //

        .loc    2       24      0

        ld.global.f32   $f1, [$rd4+4];  //  id:56 force+0x0

        .loc    2       25      0

        mov.s32         $r4, 1;                 //

        ld.global.s32   $r5, [$rd4+0];  //  id:57 force+0x0

        setp.eq.s32     $p1, $r5, $r2;          //

        selp.s32        $r3, $r4, $r3, $p1;     //

        .loc    2       26      0

        bar.sync        0;                      //

        mov.s32         $r6, 0;                 //

        setp.ne.s32     $p2, $r3, $r6;          //

        @$p2 bra        $Lt_0_10;               //

 //<loop> Part of loop body line 20, head labeled $Lt_0_9

        .loc    2       30      0

        add.f32         $f1, $f1, $f2;          //

        .loc    2       31      0

        st.global.s32   [$rd4+0], $r2;  //  id:58 force+0x0

        st.global.f32   [$rd4+4], $f1;  //  id:59 force+0x0

$Lt_0_10:

 //<loop> Part of loop body line 20, head labeled $Lt_0_9

        .loc    2       21      0

        add.s16         $rh1, $rh1, 1;          //

        mov.s16         $rh2, 4;                //

        setp.ne.s16     $p3, $rh1, $rh2;        //

        @$p3 bra        $Lt_0_9;                //

You will notice that the integer and float parts are fetched and stored separately. They are far enough apart that I doubt that ptxas is going to glue them back together - if it did I throw my hands up in horror! Change the “int” to “float” in the typedef and it works as expected.

Morph208 · June 27, 2007, 3:05am

Mmm. Interesting. I’ve used a float2 array when I implemented it so it worked. But it’s interesting you’ve noticed this. In fact we need a pretty specific case for making it work.

Osiris> And yes of course, you’re right about the divergence condition. I’ve explained this to Shantha, but I guess I wasn’t well awake this morning when I’ve written this.

vpodlozhnyuk> fair enough. I fully understand why you were not expecting use of global memory. I’m aware it’s bloody slow. So in order to answer you,

About coalesced writes
I’m afraid I can’t coalesced my writes. My application is a finite element solver. The algorithm computes some element variables. At the end of the kernel I get 4 contribution values by element from these variables, one for each node. I need to add these nodal values to a node size array. As an example, let’s consider the element 15 and let’s say its 4 nodes are the numbers 4, 12, 13 and 20. At the end of the computation of element 15, I need to add the computed values into force[4], force[12], force[13] and force[20]. And since elements can share a node, my problem is that several threads are gonna try to add their contribution for the shared node in the same location in the memory. Actually the NodeNumber comes from a texture fetching (I have a texture which contains the list of nodes for each element)
About shared memory
For writing the results like this, I need a node size array shared by all the threads. The shared memory is only shared within a block. Therefore I can’t use this. I need something which covers all my threads. And it’s my final results anyway. So I need to write the resultats in global memory to retrieve them later from the host. I can’t write the results of my kernel in shared memory.

So for all these reasons, I’ve abandonned this technique and I’m using 4 textures at the end of my kernel, one for each node. My writes are coalesced since I’m considering element size arrays. Then I’m using a second kernel on nodes which grabs for each nodes all the element contributions previously computed and adds them. So I need 2 kernels. I wanted to avoid this to gain some speed. But apparently the hardware limitations prevents me to do this so far.

I hope I’ve been clear with all my explanations.

osiris1 · June 27, 2007, 4:28am

Another question that is brought up with this technique (or anything that relies upon atomic 64 or 128 bit writes) is whether the emulator works. On the surface I would say maybe unemulatable as it depends upon the host CPU having uninterruptable writes and they being correctly connected up. 64 bit should work on 64 bit architectures OK, but 128 bits on IA32??? If not, another health warning missing from the emulation section in the guide.

On the compiler problem above, I believe one could use st.global.v2.b32 for a mixed type. If not, perhaps PTX is too restrictive.

Eric

rgb1380 · June 27, 2007, 8:33am

Could some please explain why thread synchronization is required for this method? It seems similar to the histogram example and nVidia haven’t used a sync method for that code.

Also, it seems that threads from different blocks can write in the same global mem location and if so, should you not use a global id for the thread based on thread id and block id?

thanks.

vpodlozhnyuk · June 27, 2007, 8:44am

Speaking informally, such tricks with colliding writes are not considered to be best programming practice. :) Especially if they don’t pay off and kill performance.

There exist native global memory atomic integer ops, exposed by atomic*() group of device calls, which are guaranteed to do what you want . But unfortunately they exist only on GeForce8600 and 8500 and anyway coalescing constraints, required for decent performance remain the same. Without coalescing (which in fact excludes writing to data-depend addresses) your global memory performance can be extermly poor. In some cases, as in Morph208’s, the problem can be reformulated to replace heavy scatter can by gather, and textures can help with gather performance, since they are cached and do not impose too heavy restrictions on access patterns, required for best performance, just 1D/2D data locality.

As for vector loads/stores not generated for aligned mixed-type structures, it looks like a bug. Thanks for bringing it out.

Morph208 · June 27, 2007, 9:14am

In the histogram example, they compute a histogram per warp. So you don’t need to synchronize since every threads with the warp are executed concurrently. Look:

They can do this because the array used for storing the count is small. In my case, the size of the output array is way too big (can go up to several 10 000s). Therefore I can use one array this big per warp…

And I’m using a global id in my code. But at the end I need to scatter the computed values and this doesn’t rely on the global id (which is the element number in my case).

Morph208 · June 27, 2007, 9:25am

Actually even the gather implementation is slow so far. In order to compute my forces, my algorithm needs quite a lot of data and therefore I’m using quite a lot of texture fetching. But this seems to use a lot of registers as well (39 for the first kernel, 22 for the second) and I’ve tried all the tricks I could find on the forum. Therefore the occupancy is 25% only and I can’t achieve good performance. The CUDA implementation is 5 times slower than the previous implementation in Cg I got…

e.ping · June 27, 2007, 2:29pm

I agree with osiris that my !done condition inside the divergence would be creating problem. I didn’t realise that earlier. I will try the struct align(8) approach and see if it works for me. If not I need to think of a gather technique.
I cannot do a sort like morph208 suggested, since my application deals with a listmode file containing >10million events. I need to process each event and write to a huge array of > 100000 elements. So using the shared memory is out of the question.I am doing this in multiple passes with each pass processing 100000 events.

-ST

e.ping · June 27, 2007, 2:29pm

I agree with osiris that my !done condition inside the divergence would be creating problem. I didn’t realise that earlier. I will try the struct align(8) approach and see if it works for me. If not I need to think of a gather technique.
I cannot do a sort like morph208 suggested, since my application deals with a listmode file containing >10million events. I need to process each event and randomly write to a huge array of > 100000 elements. So using the shared memory is out of the question.I am doing this in multiple passes with each pass processing 100000 events.

-ST

vpodlozhnyuk · June 28, 2007, 6:30am

Try setting -po maxrregcount=N nvcc parameter. At the very least you should be able to get the same performance in CUDA as in Cg, since the amount of hardware resources remains the same, with the exception of shared memory and other features, exclusive to CUDA and not visible to graphics.

vpodlozhnyuk · June 28, 2007, 6:47am

Eric, meanwhile to bypass the trouble you can define Acc.y as int, and during computations do int-to-float bitwise reinterpret cast with dedicated pseudofunction call. Of course it will slightly mess up your high-level code, but won’t cost anything in terms of performance. In general, the hardware doesn’t make any difference between data types when fetching data from global memory or textures.

Topic		Replies	Views
Problem with correct branching within a warp CUDA Programming and Performance	23	15644	May 28, 2009
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8781	December 1, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55577	March 8, 2007
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	12481	February 7, 2023
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19759	January 21, 2010
pointer to shared memory compiler problems CUDA Programming and Performance	19	14575	June 7, 2008
Program works on GTX 260 but not on GTX 480 CUDA Programming and Performance	34	3267	November 24, 2010
Global thread barrier CUDA Programming and Performance	78	85686	December 23, 2011
coalescing struct loading problem CUDA Programming and Performance	21	12725	March 5, 2010

Writes in same memory location Cant add numbers from different threads?

Related topics