Help me understand shared memory allocation

Hi, I am trying to speed up some scientific computation in Mathematica with cudalink.

Mathematicas cudalink just passes stuff along to visual studio for compilation.

Now my problem is with memory allocation. I tried some examples from the nvidias website. For isntance the prefix sum calculation code.

It fails to run as such, because what seems as a segfault from

extern __shared __ float temp;

So I guess I have to do allocation of shared memory as i call the kernel.

In mathematica I do this by passing in a parameter along the run (I haven’t found how to directly allocate at kernelcall in cudalink) so that i have

extern __shared __ float temp[BLOCK_DIM];

but now i get the compiler error

error: __local __ and __shared __ variables cannot have external linkage

And once i remove extern things seemingly work, but i get the wrong results from the prefix sum code.

the code was taken from here

But i also had this same problem with another example i found online. Could someone help me understand this? If i am able to assign at kernel call memory, then can i just specify it as external, or is there some different version of cuda or something that this example is made udner?

Tried reading CUDA Programming Guide 4.0?