Good programming practice Writing shared & global memory

Is it regarded as bad practice to write the same shared memory location from all threads with exactly the same data? I read somewhere that the results of a concurrent write are undefined, but presumably the hardware gets it right if data is identical. Are there any performance penalties (ie should one branch on tid around such writes)?

The situation is a little different for global writes as there could be a significant penalty if one warp arrives at such a write after others have already completed. Does the hardware write asynchronously to global memory (how deep is the queue)? If so then __syncthreads() would need to wait for all such writes to complete. Just trying to work out performance tradeoffs.

Thanks, Eric

__syncthreads does not wait on global mem operations. It just blocks until all threads hit it. This is why the inter-block communication attempts some folks tried (see somewhere on this board) are doomed.

Peter

So Peter you are definitely saying that global memory write are asynchronous.

Back to the topic - what is good practice and why?

I would say good practice is when every thread writes to different memory locations at any point in time. So you should write to a shared variable only from one thread between two __syncthreads.

My reasoning is that it first is more clear if there is an (if tid==0) to see who is updating the resource and second you might get away with the concurrent writes now but the behavior may change in the future.

Peter

Thanks Peter. I have been doing this to date but wanted to check as the architecture spec does not say so explicitly.

It says write order is undefined but does not tell us if these writes are serialised or not. Code is clearer without these branches all over the place and if the spec said explicitly that on any given shared memory write cycle one result would be written and the rest would be ignored at the same time, then it would be safe, and faster to run without.

As I pointed out it is probably always advisable/necessary for global memory updates.

Nvidia please confirm the specification situation and perhaps consider an update to the guide.

Thanks, Eric

Thanks Peter. I have been doing this to date but wanted to check as the architecture spec does not say so explicitly.

It says write order is undefined but does not tell us if these writes are serialised or not. Code is clearer without these branches all over the place and if the spec said explicitly that on any given shared memory write cycle one result would be written and the rest would be ignored at the same time, then it would be safe, and faster to run without.

As I pointed out it is probably always advisable/necessary for global memory updates.

Nvidia please confirm the specification situation and perhaps consider an update to the guide.

Thanks, Eric

Aside: On checking out the assembler for one of these single thread writes to shared memory the code looks like all threads read the shared location, then all threads write the same location with the data the same as the read unless the selected thread in which case the write is for the new value. Very odd? Anyone know how this can work?

ed: Sorry for the second one but I got a blank screen after adding the first one and it did not show in another tab on the forum page so I submitted it again. This has happened to me quite a few times…

Version 0.8.2 of the programming guide has clarified this point:

If the instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, how many writes occur to that location and the order in which they occur is undefined, but one of the writes is guaranteed to succeed.

Let me know if this doesn’t answer your question.

Cyril

Thanks Cyril, this just reinforces my statement that you should avoid having more than one write the same value as you will get additional writing latency if they all try to write to device mem.

Peter

Actually Peter I think it implies that it is just fine to write the same thing from multiple threads (to shared memory) and that this is the most efficient, and I think my code would look neater.

ed: the expression “how many writes” does not say whether there is a speed penalty for concurrent shared memory writes to the same location. I did make a presumption that there is no such penalty as that is the way the current compiler generates code. As I originally suggested the situation is definitely different for global memory.

Thanks, Eric

Cyril could we please have a couple of words like “these writes are not serialised” which would have the correct meaning for both device and shared memory?
Thanks, Eric

In my mind, if multiple writes occur for a single warp, they are serialized by definition, but we’ll make this clearer by saying “how many serialized writes occur to that location and the order in which they occur is undefined”.
So, as Peter says, you will have perf penalty if more than one writes occur (and as usual, it’ll be worse for writes to global memory than to shared memory).

Sorry to be a pain Cyril, but are you sure you are correct? My offhand logic said the same, concurrent writes would look like bank conflicts at first glance, however, it seems that the current hardware does not work this way and that the compiler has been designed to take advantage of the fact. When I compile this program:

__shared__ int  j, k;

__global__ void kernel()

{

    if (threadIdx.x == 2) j = 1;

   k = threadIdx.x == 2 ? 1 : k;

}

I get to following code:

.entry kernel

        {

        .reg .u32 $r1,$r2,$r3,$r4,$r5,$r6,$r7,$r8,$r9;

        .reg .pred $p0,$p1,$p2;

 #      .loc    10      3       0

 #   1  __shared__ int  j, k;

 #   2

 #   3  __global__ void kernel()

$LBB1_kernel:

 #      .loc    10      5       0

 #   4  {

 #   5      if (threadIdx.x == 2) j = 1;

        cvt.u32.u16     $r1, %tid.x;            #

        mov.s32         $r2, 1;                 #

        ld.shared.s32   $r3, j;         #  id:9 j+0x0

        mov.u32         $r4, 2;                 #

        setp.eq.u32     $p1, $r1, $r4;          #

        selp.s32        $r5, $r2, $r3, $p1;     #

        st.shared.s32   j, $r5;         #  id:9 j+0x0

 #      .loc    10      7       0

 #   6

 #   7      k = threadIdx.x == 2 ? 1 : k;

        ld.shared.s32   $r6, k;         #  id:10 k+0x0

        mov.s32         $r7, 1;                 #

        mov.u32         $r8, 2;                 #

        setp.ne.u32     $p2, $r1, $r8;          #

        selp.s32        $r9, $r6, $r7, $p2;     #

        st.shared.s32   k, $r9;         #  id:10 k+0x0

        exit;                           #

        } # kernel

        .version 1.1

Careful examination of both code fragments shows that they are identical, one writing shared variable j and the other k. Now the first one writes j to 1 only in thread 2 and the second writes k to 1 in thread 2 and to its original value in all other threads. Both these fragments will have exactly the same effect when run, so one can deduce that they are equivalent expressions, so in warp 0 (which contains thread 2) a write always succeeds for a thread with a different value to what is already stored in the target location (at least if there is only 1 different one). The important thing is that the writing of j (st.shared.s32 instruction) must be concurrent, not serialised, otherwise providing these branch around shared memory writes would NOT give you any performance benefit, and there would always be a 16 way bank conflict, and so likewise the writing of k must be concurrent, not serialised. Why would you change the hardware spec in the future to something sloppier? All stores to shared memory not indexed by tid generate 16 way bank conflicts - seems very unlikely. You would not design hardware to serialise writes when you know that only one is going to succeed in any case. Currently there must be some logic to select the different value to succeed.

Thanks, Eric

Cyril, it seems simply that given 1 write must succeed correctly, the hardware has to select it, and then just does it, as there is no point in doing the rest.

There is more than just saving 4 instructions and clearer source code here - it can also save you a sync as the first warp to get to one of these writes will set it and does not have to wait around for warp 0 (or whatever) to get there before using the value.

On device memory the most one can say, that is implementation independent, is that you will get at most N writes, where N is the number of warps.

Thanks, Eric

Just a followup: why didn’t Cyril say that the code quoted above was a bug? It has been fixed in 1.0 and there are branches around single thread writes now. My instruction timing topic measured 32 clocks for a shared mem write from all threads to the one location with the same data. This might get fixed in newer hardware that allows for address comparison on the input to the crossbar between registers and shared memory (alternative to broadcast mechanism for reads that works for writes as well). Don’t suppose we will hear.

BTW you do get 2 writes if a warp writes to the same global memory location.

Eric