Suggestion for nvcc Overlaying auto shared

osiris1 · May 3, 2007, 5:09am

When I compile the following nonsense test program:

_shared__ int shared[8];

__device__ void share1(void)

{

    int                 i = threadIdx.x;

    __shared__ int      j[8];

   j[i] = shared[i+1];

    shared[i]  = j[i+1];

}

__device__ void share2(void)

{

    int                 i = threadIdx.x;

    __shared__ int      j[8];

   j[i] = shared[i+1];

    shared[i] += j[i+1];

}

__global__ void kernel(void)

{

    share1();

    share2();

}

I get the following ptx file (extract of):

       .shared .align 4 .b8 shared[32];

        .shared .align 4 .b8 _ZZ6share1vE1j[32];

        .shared .align 4 .b8 _ZZ6share2vE1j[32];

......................

        .entry kernel

        {

        .reg .u32 $r1,$r2,$r3,$r4,$r5,$r6,$r7,$r8,$r9,

                $r10,$r11,$r12;

 #      .loc    10      21      0

 #  17      j[i] = shared[i+1];

 #  18      shared[i] += j[i+1];

 #  19  }

 #  20

 #  21  __global__ void kernel(void)

$LBB1_kernel:

 #      .loc    10      8       0

        cvt.u32.u16     $r1, %tid.x;            #

        mul.lo.u32      $r2, $r1, 4;            #

        mov.u32         $r3, (&shared);         #

        add.u32         $r4, $r2, $r3;          #

        mov.u32         $r5, (&_ZZ6share1vE1j); #

        add.u32         $r6, $r2, $r5;          #

        ld.shared.s32   $r7, [$r4+4];   #  id:26 shared+0x0

        st.shared.s32   [$r6+0], $r7;   #  id:27 _ZZ6share1vE1j+0x0

 #      .loc    10      9       0

        ld.shared.s32   $r8, [$r6+4];   #  id:28 _ZZ6share1vE1j+0x0

        st.shared.s32   [$r4+0], $r8;   #  id:29 shared+0x0

 #      .loc    10      17      0

        mov.u32         $r9, (&_ZZ6share2vE1j); #

        add.u32         $r10, $r2, $r9;         #

        st.shared.s32   [$r10+0], $r7;  #  id:30 _ZZ6share2vE1j+0x0

 #      .loc    10      18      0

        ld.shared.s32   $r11, [$r10+4]; #  id:31 _ZZ6share2vE1j+0x0

        add.s32         $r12, $r11, $r8;        #

        st.shared.s32   [$r4+0], $r12;  #  id:32 shared+0x0

 #      .loc    10      24      0

 #  22  {

 #  23      share1();

 #  24      share2();

        exit;                           #

        } # kernel

        .version 1.1

Question is why are the shared segments _ZZ6share1vE1j and _ZZ6share2vE1j not overlaid? The compiler can easily work out which autos can be overlaid and save heaps of a precious resource. Sorry if there has already been a topic, I could not find it, but this does seem to be pretty basic! It blows my shared mem budget out of the water!

Thanks, Eric

wumpus · May 3, 2007, 8:52am

I stumbled on this as well, it seems the shared memory areas of all kernels in a .cu file are added together, wether they are defined in a function scope or globally. This potentially wastes a lot of the ‘precious resource’ and is actually stated in the manual as “shared memory has implicit static storage”. But I didn’t find any way around it but to use dynamic shared memory or define separate kernels in separate .cu files.

osiris1 · May 3, 2007, 8:59am

Yes I read the spec and still expected that it would be done the obvious way… having to use dynamic shared makes it messy & difficult to package up a subroutine library (actually macro library).

ed: it seems at this level the same applies to registers, though there is an implication that further optimising may happen below the .ptx code level. Is there a register colouring pass to eliminate unnecessary registers on a global basis before kernel launch?

Mark_Harris · May 4, 2007, 2:11pm

I think we’ve said elsewhere in this forum that this is a known bug and is fixed for the next release.

Mark

osiris1 · June 10, 2007, 7:57am

Unfortunately it seems not! Running the same test above in 0.9 still gets the same results. The bug with multiple kernels within the same cubin has definitely been fixed but not overlaying of auto shared. Any info on this one?
Thanks,
Eric

prkipfer · June 11, 2007, 10:03am

Yes, the bug with summing the shared mem of all kernel in a .cu is fixed in 0.9.

In your example above, I don’t see how the compiler can figure out that he can overlay the two shared j arrays. They live in device functions which get inlined and there is no __syncthreads between them so access is unrestricted at runtime. I don’t think that any compiler can do semantic interpretations to remedy this (ie. knowing what threadIdx.x means).

Peter

osiris1 · June 11, 2007, 11:29pm

I did say a nonsense program! It was dashed out like that to ensure the compiler would not optimise anything out, without having to test it. The missing __syncthreads() is a red herring as we are currently required to put them in wherever there is crosstalk between warps. It makes no difference here. If you call one of the functions twice then there are not 2 copies of j in smem and if there can be any interference it is up to you to put the __syncthreads() in. inlining of functions is also a red herring as the overlaying should also work between non nested blocks within the same function.

The compiler does know about tids, at least Nvidia think that it should - section 5.1.1.2 (in 0.9) says that it does something different depending upon whether it thinks the control variable in a conditional branch will cause divergence within a warp. Now the compiler does not actually do any of that, but they think it should! (Don’t base any design decision on that section). This is not relevant here.

There are 2 options for implementation - in open64 where just scope based overlaying can be done, or in ptxas where a better full colouring solution on shared mem segments could possibly be done, however is rather tricky as it has to keep track of all pointers to.

While shared memory is still rather small the maximum use should be made of it.

Eric
(I am not a compiler person)

GarryB · June 12, 2007, 3:09am

I think there are several things going on here.

My understanding of shared is the lifetime is exactly and only equivalent to C ‘static linkage’, with the extra storage qualifier of in ‘on-chip memory’. So I disagree that the ‘obvious’ way to have _ZZ6share1vE1j and _ZZ6share2vE1j be managed is for them to be overlaid. If I declare 3 static variables in C (irrespective of names), I get 3 chunks of memory for the lifetime of the program (not function or block). So this is not a bug, it’s doing ‘what it says on the tin’ (sorry if those adds don’t play outside the UK). Unfortunately, shared has two effects, not one, and IMHO it’s the co-mingling of lifetime and storage qualifier that is messing things up.

A nice partition of work would have been to use shared as a storage qualifier or ‘placement statement’ (I think that’s old C++, but may be from an old C++ object database), and use ‘static’, global and auto storage classes in their traditional meaning. (extern is correct if global exists).

As mentioned, the effect of having ‘overlayed’ use of the same chunk of shared (like FORTRAN EQUIVALENCE) can be done using the dynamic shared ‘trick’ (described in the programers guide). So NVIDIA can reasonably say ‘just do it yourself’. BUT a down side of DIY shared management is the life-times of the competing uses of shared have to be manually managed by the developer, and in a complex scenario, it would be much more productive (i.e. less error prone/safer/easier to debug/…) for the compiler to do that. (EQUIVALENCE is depracated in FORTRAN in this millenium :) )

I believe Osiris is asking for something which is not the existing meaning of shared (equivalent to ‘static’ linkage), but is like ‘auto’ + stored in on-chip memory. This seems like a great idea. It may be too late to redefine these meanings, but ‘logically’ they should be:

static __shared__ int fee; // lifetime of the program, file scope, in shared on-chip memory

{ __shared__  int fie; } // lifetime of the block (exactly like auto) in shared on-chip memory

{ static __shared__  int foe; } // lifetime of the program, block scope, in shared on-chip memory

__shared__ int fum;      // lifetime of the program, global scope, in shared on-chip memory

extern __shared__ int fum;  // allocated global elsewhere, but it's in shared on-chip memory

In the current model of ‘all code is inlined’, the opportunity for the compiler/ptx-er to manage this for the developer is there. In the absence of loops (and goto’s), it’s the easy case.

I think __syncthreads() is not essential to having auto shared work correctly. I’m not convinced (yet) that I want to have __syncthreads() be an essential or only ‘marker’. I think I’d like to have block lifetime (i.e. {…}) be the mechanism for telling the compiler/ptx-er where an auto comes to life or dies, and I be left to manage the threads. Then I have more flexibility, and can have the interpretation of auto shared even when there are divergent threads. If I can’t cope, I can always insert a __synchthreads() :D

If this is the direction NVIDIA go, then in the general case, where function calls are supported, it’s harder to manage auto shared, but it’s a similar problem to efficient static register allocation (just a different register file). So once that is solved by the smart folks at NVIDIA, I think they have a solution which applies to auto shared.

My $0.02.

GarryB.

Edit: this apears to retain the meaning of shared for use in the other scenario’s such as declaring pointers like

int *shared p;

or in function parameters. So good news all round ;)

GarryB · June 12, 2007, 3:36am

I wouldn’t require the compiler (or a subsequent ptx/cubin-mangler) to figure out that __synchthreads() is protecting the overlayed use of share variables.

Instead, I suggest the rule is as simple as using the static lexical structure of {…} blocks. Then, I think, the compiler can figure this out as it (and the subsequent ptx/cubin-mangler) already solves static register alocation.

Ensuring parallel/diverent threads aren’t screwing up global device memory is already the developers problem, and this appears to be just the same but for shared on-chip memory (though I am always willing to be corrected). Ideally, it is solved by the same, or similar, __synchthreads() mechanism.

GarryB

PS - I feel that __synchthreads() is a bit too big a hammer, and it may be useful to have more than one __synchthreads() barrier on each multi-processor, but let’s skip that for now.

osiris1 · June 12, 2007, 4:31am

Thanks, Garry I agree with you here - you will notice above that I never called overlaying auto shared a bug, just a suggestion. I know the can says shared has implied static storage and that is what we have. Just there is a better way to do it. I think the scope based method is preferable, then we can see what is happening. Anything ptxas does we have not a clue about!

My workaround was a dynamic smem allocator but that came acropper with shared memory pointer problems in the compiler. More to be said in that topic.

Eric

PS - I also have a lot to say about __syncthreads(), such as a much better implementation that is simple and backwards compatible with what we have. One does feel a bit ostracised here for speaking out, when I feel that I have made a lot of positive contributions over the last couple of months. This is a lot of new technology and Nvidia had to cut some corners to get it out. Time to go back and fix them once the market is rolling.

prkipfer · June 12, 2007, 9:26am

I agree. But note that (at least for now) the only way one could make the compiler understand the auto shared overlay is to use syncthreads as it is the only way the compiler can assume anything about the runtime state of the threads. So without, even explicit scoping will not help the compiler (we are still talking about osiris’ example code, right) as threads are free to write to any (possibly data dependent !) shared mem location.

Absolutely.

Peter

osiris1 · June 16, 2007, 2:24am

Both Garry and I are saying that syncthreads is not required - I say it is not related at all. The compiler does not currently worry about enforcing SIMD across warps (it could, and save us all from having to worry about syncthreads - another topic). Scope based overlaying ({…}) is simplest and easiest, if not the absolute ideal. For example my main kernel (which is quite complex) has not one syncthreads because it is organised to keep each warp independent. It would definitely benefit from overlaying of auto shared, especially where just a few registers have been moved out to shared, to reduce register count within a given block. The current compiler implementation is based around a single descent model with separate kernels invoked for each major function (all the samples and the libraries). As soon as complexity goes up and you have multiple different descents you need overlaying of auto storage, if you don’t have a stack based implementation. ptxas does register colouring to sort them out but does not fix auto shared. Will we get a stack first? Who knows - there are call opcodes in the ptx opcode list in the compiler…

Eric

Was told it was done above - fair to be disappointed when it is not.