Writes in same memory location Cant add numbers from different threads?

Morph208:

I am facing the same problem like you are. I am trying to implement the method suggested. I had one question regarding your two for loops. I understand the topmost one (i loop). But couldn’t get the second one (q loop). Is this the number of attempts that you make to try to update the force array in the thread? If so, what is the value of the VALENCY did you use to make it work ?

Thanks,

  • ST

Hi Shantha,

You’ve well understood, the second loop is to make sure the update occurs. The valence must match to your worse case. Sadly you can’t make a loop, check if the write occurs and do two different things afterwards (write or not write) and thus stopping to try to write if it’s already worked. Because with a such loop you’re gonna cause a divergence. And since before writing you need to synchronize (the execution is suspended till all the threads reach the synchronization point). And because of the divergence the threads which managed to write won’t be able to reach the synchronization point and this will crash your computer.
Instead you have to try writing as many times as your worse case. What is your maximum of collision you can get? This is this number you must choose as valence. I’ve called it valence because in my case I’m working on a finite element solver and this number matches to the valence of my mesh (the maximum number of elements with which a node can be bound).

Although it seems to produce a correct result, be aware that this is far from being efficient and thus this is very slow. At least with cuda 0.8 which has some bugs with global memory. You would probably be more efficient if you re-organise your algorithm in a different fashion.

   float2 temp;

    

    for (int i=0; i<3; i++)

    {

        int done = 0;

        

        temp = force[4*NodeNumber+i];

        __syncthreads();

        temp.x = tid;

        temp.y += F[i];

        force[4*NodeNumber+i] = temp;

        

        for (int q=0; q < VALENCE; q++)

        {        

            __syncthreads();

            temp = force[4*NodeNumber+i];

            // Check if the write succeed

            if ( force[4*NodeNumber+i].x == tid )

            {

                done = 1;

            }

            __syncthreads();

            // If not, write again

            if ( !done )

            {

                temp.x = tid;

                temp.y += F[i];

                force[4*NodeNumber+i] = temp;

            }

        }

    }

The code is very unlikely to work for arbitrary case, since shared memory update is non-atomic (subdivided into many instructions), and the only thing we are guaranteed about scheduling is each thread within a warp always executes the same instruction in SIMD fashion, but different warps execute different instructions until __syncthreads(). So if collisions stay within warps, tagging data and waiting until shared memory write succeeds is guaranteed to work, but collisions between warps can’t be reliably resolved by such approach. Have you tried “worst case” input data (giving maximum possible level of collisions)? Also each thread should not just calculate new shared memory value once and wait until the write succeeds, but respin the increment after write failure, basing on new shared memory value, written by the last “lucky” thread colliding to the same node.

First of all this code has been done using global memory. This doesn’t change anything about the non-ability to perform atomic operations. But this means the code works within a block and not within a warp. So you need to gather all collisions within blocks. Since the order of blocks running on the device is undefined, you can’t play with the order of the blocks to avoid collisions. But if your collision are gathered within blocks, I think this code’s working. No?

What do you mean by

? The loop for assuring writes is already done on the worse case (valence is the maximum number of collision I can get).

But I don’t need to

. In the beginning of the q loop, I read again the value from global memory before trying to increment it, specially to grab the latest value written by the “lucky” thread.

Yes Victor has his wires crossed - this code works as long as the pseudo op st.global.v2.f32 will always write 64 bits in an atomic operation. That is what needs to be confirmed by Nvidia, also a confirmation for st.global.v4 would be useful as well. We need to know that this will not be broken in future generations of the hardware though its requirement may disappear when atomic ops can do float operations (which has been promised somewhere here - not in compute 1.1).

Definitely does not work into shared mem where all writes are 32 bit. Atomic ops don’t work into shared mem either.

Morph208 - you will probably get a performance improvement by unrolling your loop - only needs VALENCE-1 times. I think backward branches are relatively slow here as there is no branch prediction. If you could do all the writes for each direction into a node separately then perhaps you might not need this (I have never done FEM).

Eric

Osiris> I’ve abandonned this approach. I came back to a gather implementation. This algorithm was way too slow for my application. But I’ve learned that 0.8 had bugs with global memory (I’ve measured my device-device memory bandwidth at only 9Gb/s under 0.8) and it might explain why it was so slow as well. So I might try again this algorithm with the 0.9 release if I can figure out a way to sort my elements by gathering all collisions within blocks. I don’t even know if it’s possible.

Oh, sorry, I was writing this late at night. On the second sight there’s nothing forbidden in the code, all __syncthreads() are there and respins are in fact done. Worst collision case is when all threads collide at the same node, thus requiring full VALENCE spin count until the update is finished (without collisions each thread does the only “spin”).

Yes, as long as align(8) option is supplied in type definition and total size of elements in the structure is 8 byte (temporary compiler limitation actually), as with built-in float2, single reads and writes are atomic. The same holds for 16-byte types like float4. 4, 8 and 16 byte global reads/writes are natively supported by G80.

But anyway because of data-dependent global reads and just because of this for(i = 0; i < 3; i++) loop your global memory reads are totally noncoalesced, and memory performance is extremly poor. Even coalesced global memory read costs hundreds of clocks per warp, the price paid for noncoalesced reads can be much higher(by a factor of 16 in worst case). Please refer to alignedTypes CUDA SDK 1.0 sample, demonstrating the difference between align’ed (coalesced) and non-align’ed (non-coalesced) global memory reads/writes. At the same time reads and writes to shared memory have the same price as local register move or any other primitive math operation(MAD, bitwise, …) in case of bank conflict-free access pattern: 4 clocks per warp. Even in the worst case of 16-way bank conflict shared reads and writes cost “only” 64 clocks per warp, which is still much faster than coalesced global read. I just didn’t expect someone is going to do such fuzzy data-dependent tricks in global memory, which are usually done in shared memory.

Maybe I’m still missing something, but you haven’t posted full kernel code or NodeNumber calculation at least, so I assume it’s just some value derived from input data.

As you mentioned on your own, it works (especially if CUDA results pass validation by “gold” CPU implementation), but far too slow to be of any use.

I have implemented this in slightly different way requiring lesser time. I still get variance in the values between emulation run and device run. But it is much better than before.

float2 temp;

for (int i=0; i<3; i++)
{
int done = 0;

   temp = force[4*NodeNumber+i];
   __syncthreads();
   temp.x = tid;
   temp.y += F[i];
   force[4*NodeNumber+i] = temp;
   
   for (int q=0; q < VALENCE && !done; q++)
   {        
       __syncthreads();
       temp = force[4*NodeNumber+i];
       // Check if the write succeed
       if ( force[4*NodeNumber+i].x == tid )
       {
           done = 1;
       }
       __syncthreads();
       // If not, write again
       if ( !done )
       {
           temp.x = tid;
           temp.y += F[i];
           force[4*NodeNumber+i] = temp;
       }
   }

}

Yes you’re right, adding !done is better of course. But you can still observe some differences between emulation and device run because of the collisions you get between the blocks. And the only way to avoid this is sorting your data.

C’mon guys you can’t put a __syncthreads() inside a possibly divergent conditional. As mentioned above - so the &&!done is illegal.

I thought this should be the case too, which is why I offered it first, but on testing it does not work - if the types in the structure are different then vector ops are not generated - so I guess this is a compiler bug report (does not affect my work so will not post an official bug report):

typedef struct __align__(8){int x; float y;} Acc;

#define TYP     Acc

__device__ TYP force[3000];

__global__ void

kernel(int i, float f)

{

    const int   tid = threadIdx.x;

    TYP         a;

    TYP*        p;

    int         done = 0;

    int         j;

   p = force + i;

    a = p[0];

    __syncthreads();

    a.x = tid;

    a.y += f;

    p[0] = a;

    for (j = 0; j < 4; ++j)

    {

        __syncthreads();

        a = p[0];

        if (a.x == tid) done = 1;

        __syncthreads();

        if (!done)

        {

            a.x = tid;

            a.y += f;

            p[0] = a;

        }

    }

}

Compiles to (in 1.0, in part, just for loop):

$Lt_0_9:

 //<loop> Loop body line 20, nesting depth: 1, iterations: 4

        .loc    2       23      0

        bar.sync        0;                      //

        .loc    2       24      0

        ld.global.f32   $f1, [$rd4+4];  //  id:56 force+0x0

        .loc    2       25      0

        mov.s32         $r4, 1;                 //

        ld.global.s32   $r5, [$rd4+0];  //  id:57 force+0x0

        setp.eq.s32     $p1, $r5, $r2;          //

        selp.s32        $r3, $r4, $r3, $p1;     //

        .loc    2       26      0

        bar.sync        0;                      //

        mov.s32         $r6, 0;                 //

        setp.ne.s32     $p2, $r3, $r6;          //

        @$p2 bra        $Lt_0_10;               //

 //<loop> Part of loop body line 20, head labeled $Lt_0_9

        .loc    2       30      0

        add.f32         $f1, $f1, $f2;          //

        .loc    2       31      0

        st.global.s32   [$rd4+0], $r2;  //  id:58 force+0x0

        st.global.f32   [$rd4+4], $f1;  //  id:59 force+0x0

$Lt_0_10:

 //<loop> Part of loop body line 20, head labeled $Lt_0_9

        .loc    2       21      0

        add.s16         $rh1, $rh1, 1;          //

        mov.s16         $rh2, 4;                //

        setp.ne.s16     $p3, $rh1, $rh2;        //

        @$p3 bra        $Lt_0_9;                //

You will notice that the integer and float parts are fetched and stored separately. They are far enough apart that I doubt that ptxas is going to glue them back together - if it did I throw my hands up in horror! Change the “int” to “float” in the typedef and it works as expected.

Mmm. Interesting. I’ve used a float2 array when I implemented it so it worked. But it’s interesting you’ve noticed this. In fact we need a pretty specific case for making it work.

Osiris> And yes of course, you’re right about the divergence condition. I’ve explained this to Shantha, but I guess I wasn’t well awake this morning when I’ve written this.

vpodlozhnyuk> fair enough. I fully understand why you were not expecting use of global memory. I’m aware it’s bloody slow. So in order to answer you,

  1. About coalesced writes
    I’m afraid I can’t coalesced my writes. My application is a finite element solver. The algorithm computes some element variables. At the end of the kernel I get 4 contribution values by element from these variables, one for each node. I need to add these nodal values to a node size array. As an example, let’s consider the element 15 and let’s say its 4 nodes are the numbers 4, 12, 13 and 20. At the end of the computation of element 15, I need to add the computed values into force[4], force[12], force[13] and force[20]. And since elements can share a node, my problem is that several threads are gonna try to add their contribution for the shared node in the same location in the memory. Actually the NodeNumber comes from a texture fetching (I have a texture which contains the list of nodes for each element)

  2. About shared memory
    For writing the results like this, I need a node size array shared by all the threads. The shared memory is only shared within a block. Therefore I can’t use this. I need something which covers all my threads. And it’s my final results anyway. So I need to write the resultats in global memory to retrieve them later from the host. I can’t write the results of my kernel in shared memory.

So for all these reasons, I’ve abandonned this technique and I’m using 4 textures at the end of my kernel, one for each node. My writes are coalesced since I’m considering element size arrays. Then I’m using a second kernel on nodes which grabs for each nodes all the element contributions previously computed and adds them. So I need 2 kernels. I wanted to avoid this to gain some speed. But apparently the hardware limitations prevents me to do this so far.

I hope I’ve been clear with all my explanations.

Another question that is brought up with this technique (or anything that relies upon atomic 64 or 128 bit writes) is whether the emulator works. On the surface I would say maybe unemulatable as it depends upon the host CPU having uninterruptable writes and they being correctly connected up. 64 bit should work on 64 bit architectures OK, but 128 bits on IA32??? If not, another health warning missing from the emulation section in the guide.

On the compiler problem above, I believe one could use st.global.v2.b32 for a mixed type. If not, perhaps PTX is too restrictive.

Eric

Could some please explain why thread synchronization is required for this method? It seems similar to the histogram example and nVidia haven’t used a sync method for that code.

Also, it seems that threads from different blocks can write in the same global mem location and if so, should you not use a global id for the thread based on thread id and block id?

thanks.

Speaking informally, such tricks with colliding writes are not considered to be best programming practice. :) Especially if they don’t pay off and kill performance.

There exist native global memory atomic integer ops, exposed by atomic*() group of device calls, which are guaranteed to do what you want . But unfortunately they exist only on GeForce8600 and 8500 and anyway coalescing constraints, required for decent performance remain the same. Without coalescing (which in fact excludes writing to data-depend addresses) your global memory performance can be extermly poor. In some cases, as in Morph208’s, the problem can be reformulated to replace heavy scatter can by gather, and textures can help with gather performance, since they are cached and do not impose too heavy restrictions on access patterns, required for best performance, just 1D/2D data locality.

As for vector loads/stores not generated for aligned mixed-type structures, it looks like a bug. Thanks for bringing it out.

In the histogram example, they compute a histogram per warp. So you don’t need to synchronize since every threads with the warp are executed concurrently. Look:

They can do this because the array used for storing the count is small. In my case, the size of the output array is way too big (can go up to several 10 000s). Therefore I can use one array this big per warp…

And I’m using a global id in my code. But at the end I need to scatter the computed values and this doesn’t rely on the global id (which is the element number in my case).

Actually even the gather implementation is slow so far. In order to compute my forces, my algorithm needs quite a lot of data and therefore I’m using quite a lot of texture fetching. But this seems to use a lot of registers as well (39 for the first kernel, 22 for the second) and I’ve tried all the tricks I could find on the forum. Therefore the occupancy is 25% only and I can’t achieve good performance. The CUDA implementation is 5 times slower than the previous implementation in Cg I got…

I agree with osiris that my !done condition inside the divergence would be creating problem. I didn’t realise that earlier. I will try the struct align(8) approach and see if it works for me. If not I need to think of a gather technique.
I cannot do a sort like morph208 suggested, since my application deals with a listmode file containing >10million events. I need to process each event and write to a huge array of > 100000 elements. So using the shared memory is out of the question.I am doing this in multiple passes with each pass processing 100000 events.

-ST

I agree with osiris that my !done condition inside the divergence would be creating problem. I didn’t realise that earlier. I will try the struct align(8) approach and see if it works for me. If not I need to think of a gather technique.
I cannot do a sort like morph208 suggested, since my application deals with a listmode file containing >10million events. I need to process each event and randomly write to a huge array of > 100000 elements. So using the shared memory is out of the question.I am doing this in multiple passes with each pass processing 100000 events.

-ST

Try setting -po maxrregcount=N nvcc parameter. At the very least you should be able to get the same performance in CUDA as in Cg, since the amount of hardware resources remains the same, with the exception of shared memory and other features, exclusive to CUDA and not visible to graphics.

Eric, meanwhile to bypass the trouble you can define Acc.y as int, and during computations do int-to-float bitwise reinterpret cast with dedicated pseudofunction call. Of course it will slightly mess up your high-level code, but won’t cost anything in terms of performance. In general, the hardware doesn’t make any difference between data types when fetching data from global memory or textures.