pointer to shared memory compiler problems

The cuda programming guide has some details about using a pointer to shared memory…

"Pointers in code that is executed on the device are supported as long as the compileis able to resolve whether they point to either the shared memory space or the

global memory space, otherwise they are restricted to only point to memory allocated or declared in the global memory space."

I have found that this is VERY touchy in the current compiler (Windows XP, v0.8.1). For instance, I am trying to load shared memory in 16 byte global memory chunks with the following code.

__global__ void kernel(uint4 *puCoeff)


    __shared__ int16 aiCoef[256];  

... Bunch of initialization of variables...

   // Load in one 16 byte line in one instruction

   // The following line is to work around a compiler issue 

    // where the pointer is incorrectly being interpreted as global memory

    //  aiCoef[uOffset] = aiCoef[uOffset];  <----------

    *((uint4 *)&aiCoef[uOffset]) = puCoeff[uBlockBaseAddr + uThreadID];



this compiles to:


	ld.global.v4.u32  {$r20,$r21,$r22,$r23}, [$r15+0];	#  

	st.global.v4.u32  [$r19+0], {$r20,$r21,$r22,$r23};	#  


But, if I do the funny assignment (aiCoef[uOffset] = aiCoef[uOffset]), then I get:


	ld.global.v4.u32  {$r21,$r22,$r23,$r24}, [$r19+0];	#  

	st.shared.u32  [$r20+0], $r21;	#  id:1139

	st.shared.u32  [$r20+4], $r22;	#  id:1139

	st.shared.u32  [$r20+8], $r23;	#  id:1139

	st.shared.u32  [$r20+12], $r24;	#  id:1139


Which is what I want.

So, a couple questions/suggestions.

  1. Can NVIDIA publish the rules the compiler is using about using shared memory pointers? Is there a better way around this problem?

  2. There should be a way to hint to the compiler that a given pointer is in shared memory (i.e. (shared*) cast)

  3. Why does it take 4 instructions to move the 16 bytes into shared memory? Is there an even faster way that moves global to shared in one instruction?

  4. Is there a way to add inline assembly yet?

Forgive me if this thread already exists on the board, but I couldn’t find anything that answered the questions. It seems like this would be a common problem for folks?


Is it possible the pointer manipulation is the problem? I am having some trouble with that right now myself. My problem is at runtime, but then again I am not using vector types - it’s worth a shot. Try rewriting that transaction a different way to avoid the pointer cast. If puCoeff is a uint4 also I see nothing wrong with the intent.

Also, when you call the kernel, do you include the last of the three arguments inside the <<<>>> which tell how much shared memory will be used?

That’s what comes to mind for me, though I am not experienced myself. I have a question for you actually: how did you get a disassembly, and where is documentation for that language?


Note the question a few down from yours on the forum: [url=“http://forums.nvidia.com/index.php?showtopic=35269”]http://forums.nvidia.com/index.php?showtopic=35269[/url] about using vectors in CUDA. It seems the problem might be that CUDA does not support vector assignments.

I too am interested in this topic (am using shared mem pointers and have not been bitten yet). So far the compiler seems to be able to keep track.

Eli, you get assembler using nvcc -ptx (.ptx is the assembler file) note you need a gobal function in your file otherwise the compiler optimises out all your code.

I asked about the assember manual (post on instruction timings) and am still awaiting an answer. I am not used to having to program an architecture where the implementation is hidden! (that was the second request, this is the third)

On inline assembly, I asked also but no reply, I think that it is not very useful as everything has to go through registers (their RISC) and you have no way to allocate them and no way to know which register the thing you are interested in is in (and there are a lot of registers).

Jesser, there are instructions to load 64 and 128 bit globals but they only go to registers, like everything else. These instructions make device mem coalescing easier. If we had a manual then all questions would be answered!

Just to re-iterate a clarification on shared mem pointer management would be appreciated.

Thanks, Eric
(please correct me if I am wrong - note Nvidia is happy to leave incorrect posts uncorrected)

Spoke too soon:

extern __shared__ float* shared[];

__device__ int


    int         a,

    int         len)


    float**     a1;

    float**     a2;

    int         i;

   i = len;

    for (a1 = shared + len; a1-- != shared; --i)


        float*  b1;

       a2 = a1;

        if (fabsf(a1[0][i]) < 5.0e-6f)


            b1 = *a1;

            *a1 = *a2;

            *a2 = b1;



    return i;


__global__ void kernel()


    test(0, 16);



#  29  __global__ void kernel()


        mov.u32         $r1, (&shared);         # ****** if I get one of these @ the start I am in trouble

 #      .loc    10      14      0

        add.u32         $r2, $r1, 60;           #

        mov.u16         $rh1, 64;               #

        mov.u32         $r3, 64;                #

        add.u32         $r4, $r1, -4;           #


 #<loop> Loop body line 14, nesting depth: 1, estimated iterations: 100

 #      .loc    10      19      0

        ld.global.u32   $r5, [$r2+0];   #  id:35 shared+0x0 ****** this should be ld.shared.u32

        add.u32         $r6, $r3, $r5;          #

        ld.global.f32   $f1, [$r6+0];   #  id:36

        abs.f32         $f2, $f1;               #

fixed in next release?

On the general issue of typing for explicit shared pointer, perhaps we could use a volatile modifier, allow us to keep using an unmodified c++ font end (I have never looked inside gcc). Device and shared memory are already by spec volatile so it is redundant. Shame shared memory did not end up in the same address space as device memory (then the hardware looks after this issue).

So how about some answers from Nvidia - I think I am contributing here.


I’m not sure what the semantics being asked for are, and what the proposal is. I believe the requirement is for a way to tell the compiler that an lvalue contains a pointer to shared memory, and have it generate code correctly.

Does the suggestion to ‘use a volatile modifier’ means some new type qualifier which can stand in contexts where ‘volatile’ makes sense? Then I could live with that. But, I would suggest using ‘shared’. Definitions would look like:

__shared__ int a[100];

int *__shared__ p = &a[n];



void func(int *__shared__ q) ...

Or does the suggestion ‘use a volatile modifier’ literally means use ‘volatile’ to be this type qualifier (and not just the general idea of type qualifiers)? Then may I vote NO.

I have two reasons:

  1. The semantics of volatile are relatively well understood, and should not be overloaded with new semantics. I currently believe the semantics of what is required are not those of ‘volatile’. I don’t mind at all if, in general, the compiler keeps values read from shared memory in registers and write them back to shared when it wants to.

  2. It makes some sense to have both volatile and shared qualfiers. For example:

volatile int* p0;               // don't cache p0 in a register, do writes and reads to memory

__shared__ int *p1;             // p1 is in shared memory, no clue where it points

__shared__ volatile int* p2;    // p2 is in shared memory, and don't cache p2 in a register

volatile int *__shared__ p3;    // don't cache p3 in a register, and it points to shared memory

__shared int *__shared__p4;     // p4 is in shared memory, and points to an int in shared memory

and expect different symantics in each case, even if the compiler doesn’t handle it right now.

Using ‘volatile’ to mean the object referred to is in shared memory, seems like a bad corner to remove prematurely; one of the useful permutations may even be lost.

I think that is a nice-to-have outcome, but shared already appears in type qualifier like places, so I wouldn’t want to constrain the solution.

I didn’t spot that, would you please give me a pointer to where nVidia has said that?

My $0.02 worth,


Edit: PS: I’m okay with attribute((shared)) in the places where I’ve used shared as a general type qualifier, I think that may work anyway. If it helps the parser, nVidia could use undecorated shared as a ‘storage qualifier’ (a new concept AFAIK).


Just to let you know: In my tests I have seen that “volatile” already has the predicted effect in the CUDA device functions. I found a situation where setting a shared variable (flag) in one thread had no effect in the other threads of the same warp unless I either added an extra __syncthreads() call or declared the shared variable volatile.

The code with no volatile modifier resulted in a ptx assembly where an unupdated register value was used instead of the shared variable. Adding the volatile modifier made the compiler include the needed ld.shared.u32 instruction to the ptx file and the code started working correctly. As I told, the additional __syncthreads() call had the same effect but the bar.wait instruction would be excess for synchronising just one warp.


/Pyry - Thank you for that feedback, that’s very helpful to know, and saves me some effort.

Thanks guys, I meant type modifier. I know it is a hack. You need to be able to have different indirections within a multilevel reference able to be specified as global or shared. My understanding is that Pyry’s use is invalid in the general sense since you cannot predict the dispatch order of warps within a block so even getting the compiler to write the variable is no guarantee that it will be set for others (unless you are running just 1 warp, and warp size is implementation dependent). You have to use a sync. That is why I said volatile is redundant for shared and global memory. It does have this use for 1 warp but in the G80 you can only get 33% occupancy this way.

My bug is that when the compiler decides to suboptimise a shared mem pointer into a register it forgets that is a shared mem pointer.


ed: a sync in a 1 warp block is a noop

Yes you can run multiple independent warps this way. The trick is to have a shared array of as many or more elements as there are warps and then let eg. blockDim.x be equal to warp size and blockDim.y define the number of warps for the CUDA function. Then each warp can have its own flag:

__shared__ volatile uint flag[8]; // 8 to be the max # of warps per block

do {

    if (threadIdx.x == 0) flag[threadIdx.y] = 0; // threadIdx.y is the warp index

    // ...

    if (continue_condition()) flag[threadIdx.y] = 1; // drawback: possible bank conflict

} while (flag[threadIdx.y]);

// etc.

What I pointed out was that this kind of a loop can become infinite if the volatile keyword is left out.

It’s true that the warp size is implementation dependent. In my opinion there should be a CUDA function something like cudaGetDeviceWarpSize to find out this kind of a limit at runtime.


Edit: PS. I like GarryB’s idea of a modifier-like shared keyword. [Could the underscores be even left away in the syntax?] It would work just like const and volatile in C++. Correct me, but shared cache is by definition non-volatile because using a value in the shared cache still needs one ld.shared.* or st.shared.* instruction that could in some cases be avoided when calculating with the same shared variable in a register.

Neat! Yes there is a reason to use volatile. Take that one back. Not sure if you will get a conflict on that write as my last query on that was never answered. Preliminary measurements indicate that perhaps not.

Thanks, Eric.

and cudaGetClockSpeed would be useful too.

ed: if we can have other modifiers then great! Should work the same as const/volatile. Yes you are right that the compiler does bring shared and globals into registers for operations. I could not think of a reason you could want/expect updates without a sync .

Thanks for the suggestion, but in my case I can seem to put enough volatiles in the right place to make the compiler happy, I get an error like this:

__shared__ volatile int16 aiCoef[];

uint4 volatile *puCoeff

1>"c:\bluefish\src\decoder\mpeg2video\mpeg2kernel.cu", line 136: error: no

1>          operator "=" matches these operands

1>            operand types are: volatile uint4 = volatile uint4

1>      *((uint4 volatile *)&aiCoef[...]) = puCoeff[...];

On the topic of adding a new “type”, I vote for a new explicit way to tell the compiler that I really am talking about a shared memory pointer. This is especially handy in cases where you’ve casted away the original type as in my example. In most cases the compiler should figure it out (even with optimizations turned on), but it would be nice to be able to force it in the cases that there is no way the compiler would know (e.g. a function call with a shared memory pointer as a parameter)

Just for grins, what happens if you do declare a pointer to be shared? e.g.

__device__ void func(unit4 *__shared__ p) ...

I did a very quick check with uchar *shared p; on Friday (but my batting average is low recently), and I think the compiler was fine with it, but my code was so simple the compiler wasn’t confused anyway.


I tried all combos I could think of and did not get anything accepted. As it turns out I think the compiler is doing the right thing at the front end for me. I came here because I needed a dynamic allocator for shared memory. All the simple tests worked but incorporating into a larger app failed. I am casting shared pointers from char* to anything and they don’t lose their sharedness immediately.

I stumbled on an example where just changing the value of the memory allocator pointer from a known constant to an unknown changed the way the code was optimised and broke it. So I posted an official bug report on that example. Don’t know if anything will get into the almost there release though.


If I’ve got this right (?), to be consistent with C associativity (like const), the syntax:

X TYPE * Y ptr;


  • the variable ptr is stored in Y memory
  • ptr is a pointer to a TYPE
  • TYPE is stored in X memory.

So “device float * shared pfloat;” refers to a variable pfloat, stored in shared memory, that points to a float in device memory.

(Currently, the compiler generates a warning on such syntax, and ignores the second declspec.)

It seems necessary to be able to explicitly tell the compiler where the variable resides, and where it points. Is this something planned for future releases?

I believe your interpretation of X and Y is correct.

To be somewhat pedantic: in general the X and Y can be about properties or specifications other than memory; const restricts the kind of operations that a compiler will generate, and volatile changes the way the compiler caches a value.

I agree that, right now, it looks very helpful to be able to tell the compiler about both the pointer and the target address of the pointer. Also it looks unlikely that more general compilation, link and load models will make ths go away (right now, everything is inline, which should be a very simple model).

Purely based on error messages I was able to generate from the Linux-based compiler, shared looks like it gets turned into an attribute(). Assuming the compiler propagates these properly through it’s internal (tree) representations, it is plausible that this could be made to work.

I haven’t noticed any comments from nVidia about this functionality, though.

Yes, I was going to say that everyone is saying the same thing here! I can see an alternate point of view that appears to be Nvidia’s tack - that the shared attribute is carried across assignments, a bit more like an interpreter, not the way C is designed to work where types have to match and there are rules for autocasting. Perhaps this was easier to implement, though it is not correct yet. Can get it right all the time except when both types of pointer get assigned to the same variable at different places in the program. My problem was with a “floatshared” type that lost its “shared” after a while. It could be that this issue has been correctly fixed in the G92 and the G80 is going to be a loner with a separate address space for shared memory (there is one address space for everything else). Guessing as usual here… some guidance would be helpful!
PS Nvidia have looked at my bug and it is in progress at present.

Just a quick note to say it looks like all my shared mem pointer problems were fixed in 1.0. Still leaves the situation where the compiler cannot possibly know what sort of pointer it is - if I push a shared pointer onto my register save stack (I don’t trust ptxas to spill the right registers and there is no way of knowing what it has done) then I can’t get it back properly. Not a major problem. There has been no clarification from Nvidia on these issues raised above for quite some time now.


[quote name=‘jesser’ date=‘May 8 2007, 07:51 PM’]

__global__ void kernel(uint4 *puCoeff)


    __shared__ int16 aiCoef[256];  

... Bunch of initialization of variables...

   // Load in one 16 byte line in one instruction

   // The following line is to work around a compiler issue 

    // where the pointer is incorrectly being interpreted as global memory

    //  aiCoef[uOffset] = aiCoef[uOffset];  <----------

    *((uint4 *)&aiCoef[uOffset]) = puCoeff[uBlockBaseAddr + uThreadID];



I’m having trouble reproducing this problem because the code above has syntax errors. First, what is “int16”? Second, unless int16 is a typedef of uint4, there’s no conversion between puCoeff and aiCoef. Third, the commented out line reads from shared memory into shared memory (at the same location!), so it’s basically a noop unless uOffset is the same for multiple threads (in which case it’s probably a bug in your code anyway).

In short, I can’t tell what’s wrong based on the code above, and thus I can’t really follow the rest of this thread.

Can someone provide an example that I can use to verify that the compiler does the wrong thing?



Resurrecting a dead thread, but I’m having this problem with the 2.0sdk:

extern shared float sh_base;
float *ptr = &sh_base[offset];

for (int i = 0; i < 10; i++)
ptr[i] = …

When compiled with -G, I get:
“Advisory: Cannot tell what pointer points to, assuming global memory space”

Which, of course, is not right. :-)

There’s still something really wrong here with shared pointers…

__shared__ float block[2][1024];

float *src = block[0], *dst = block[1], *tmp;

for (...) {

  tmp = src; src = dst; dst = tmp; /* Swap src/dst arrays */


This causes the GPU to crash on any write to dst AFTER a swap occurs.

__shared__ float block[2][1024];

float *src = block[0], *dst = block[1], *tmp;

for (int i = 0; i < 10; i++) {

  src = block[(~i) & 1];

  dst = block[i & 1];


A crash does not occur in this case, but the results aren’t right.

In both case A and B, running under the emulator produces the result I’d expect.

Are there any other suggestions that might work under SDK/Toolkit 2.0?