2D shared float array configured externally

I want to use a shared float array of size [64][64] but since it’s not possible to use all of the shared memory I can’t do that.

Someone told me that it’s possible to use all of the shared memory if I declare it “extern” and set the size in runtime, I however still want it to be a 2D array, how do I do that?

If I try

extern shared float shared;

I get

error: an array may not have elements of this type


extern shared float shared;

works fine, but then I have to do all the 2D indexing myself, instead of doing shared[a][b], is there any easy solution to this problem?

The problem is that the first 64 bytes of shared memory are reserved for kernel parameters and configuration values. It has nothing to do with extern or not. The trick you are probably thinking of requires pointer trickery with negative indices to access these extra 16 words and overwrite them. This is not guaranteed to work in general (you have to make sure the generated PTX doesn’t access those memory locations thinking they contain the usual parameters after you obliterate them) and on Fermi this trick will immediately segfault the kernel due to improved memory protection.

The immediate problem here is that there is no way to tell the compiler the row size so it can do the index calculation when the array is not declared with static dimensions. Most people solve this with a define macro to do the index calculation without cluttering up the code.

Ok thanks, is it true that it’s possible to use all the shared memory in Fermi?

No idea (compile with -arch sm_20 --ptxas-options=-v to find out), but Fermi defaults to 48kB of shared memory, so the array you mentioned above will fit.

Yes I know :) Then I can run 3 blocks on each multiprocessor and thereby the code will be three times as fast :D

the usual way I do this is:

const unsigned int bidx=blockIdx.x|cdata[0];

const unsigned int bidy=blockIdx.y|cdata[0];

and so on. (where cdata[0] is in constant memory and equals 0)

compiler can’t figure out on compile time that these are actually NOOP, and will (usually!) keep them in registers.