Constant memory allocation and initialization

The SDK and Programming Guide are pretty sketchy on the topic of allocating and initializing constant memory.

Though several posts provide hints here and there, a single reference point would be very helpful!

Specifically, I’m unclear on how to dynamically allocate constant memory. Would this be similar to dynamically allocated shared memory – i.e., using a single base array and offsetting into this array?

Many thanks!

You can’t dynamically allocate constant memory. What’s your use case?

Note that unlike shared memory, the total amount of constant memory used by a kernel can’t affect performance of execution.

When I say “dynamically allocate” constant memory, I mean in the sense analogous to what is described in the Programming Guide 0.8 (pp. 19-20) for shared memory:

I have several arrays whose number and length may vary from kernel invocation to kernel invocation. (Specifically, each array contains a set of discount factors or dividends for multiple times in the future.) These naturally fit into constant memory because:

a) they are constant for the life of the kernel invocation;

b) they are shared across blocks;

c) they will benefit from cacheing, since a particular thread block will use only one set of arrays;

d) they are too big in general to fit into the shared memory.

My question is, how do I declare these arrays? Since their size is variable at kernel invocation, I can’t declare them statically.

I was trying something like this, analogous to shared memory “allocation”:

extern __constant__ float constMemPool[];

// inside the kernel...

__constant__ float *array1 = &constMemPool[0];

__constant__ float *array2 = &constMemPool[array1Len]; 

__constant__ float *array3 = &constMemPool[array1Len + array2Len];

...

where array1Len and array2Len are parameters passed to the kernel. (Similar things for 2D arrays.)

Am I completely off track here?

The other question is how to initialize the constant memory before the kernel call? There are hints about cudaMemcpyToSymbol(), for instance:

Topic 28780

But I’m still unclear on exactly how to (correctly!) initialize constant memory. (The Programming Guide doesn’t offer much help, either.)

Do you mean, e.g., running out of shared memory? The docs say the GeForce 8800 series have 64KB of constant memory, with 8KB cache per multiprocessor. My constant data can fit in 64KB, and the cache will hold what is necessary for each thread block. “For all threads of a warp, reading from the constant cache is as fast as reading from a register as long as all threads read the same address.” (Programming Guide 0.8, p. 58) So I think constant memory is the way to go in my case.

Looking at the .ptx files, it seems to me that constants are embedded into executable at compile time (that’s why you have to copy bytes to addresses specified by symbols to set constant values at run-time ). So, you wouldn’t be able to allocate constant arrays of different sizes for different invokations of the same compiled kernel.

I think you should be able to use texture memory for your purposes. Texture sizes can be set dynamically and they are cached. Just make sure to sample in the exact texel center to avoid filtering (unless filtering helps you).

Paulius

Cyril and paulius, thanks for the pointers.

I’ll rejigger this using texture memory, and keep digging into the constant-memory issue. (Can’t test on card until back in the office Monday.)

Still, a “How to Use Constant Memory” primer from the nVidia experts would be much appreciated, and prevent a lot of hapless guesswork. :thumbup:

Actually, if you know the maximum size of the array you’d like to put in constant memory, you could just statically allocate that and copy only the desired number of bytes at run-time. I haven’t tried constant arrays, but they ought to work fine.

Here’s how I’ve used constant qualifier for scalar variables:

__constant__ int constantN;

__global__ function(...)

{

    int x = constantN*constantN;

    ....

}

int hostFunction()

{

    int dim = 512;

    cudaMemcpyToSymbol(constantN, &dim, sizeof(dim));

    ....

}

Perhaps there’s a better way. I can’t say that this is any better than just passing an argument to the global function (passing arguments may actually lead to more readable code), since the constant must be loaded into a register to be useful. However, it makes good sense if you have arrays of constants. Just make sure that all your constant data fits (Section 5.1 of the programming guide).

Paulius

Yes, that’s good. I, too, prefer parameters.

Here’s what I’ve done – works under the emulator so far:

__constant__ float constMemPool[16384];  // ... or some other big enough number

__global__ void kernel(unsigned int nElem0, unsigned int nElem1, unsigned int nElem2, float *g_Out) 

{

    __constant__ float *array0 = constMemPool + 0;

    __constant__ float *array1 = array0 + nElem0;

    __constant__ float *array2 = array1 + nElem1;

   // do some things here that reference these arrays

    // for example:

    float x = array1[threadIdx.x];

    ...

}

void host_function()

{

    ...

    // h_arrayIn contains the elements of all input arrays "linearized"

    // so they may be copied into constMemPool

    cudaMemcpyToSymbol(constMemPool, h_arrayIn, cbIn, 0);

    kernel<<<grid, threads, cbSharedMem>>>(n0, n1, n2, d_vfOut);

    ...

}

All this seems to work… (famous last words)

… but, I’m guessing the constant memory allocated to constMemPool is blocked from all other uses until the application terminates.

That’s not a problem if this is the only application using the GPU, since I can reuse constMemPool for other kernel calls.

However, if multiple applications are competing for the GPU’s resources, could the GPU easily run out of constant memory (e.g., 64KB limit for G88 series)?

The above compiles with -deviceemu flag.

However, when I try to compile without this flag, I get an error “error: expression must be a modifiable lvalue” on the lines:

 __constant__ float *array1 = array0 + nElem0;

   __constant__ float *array2 = array1 + nElem1;

So, I guess this won’t work quite as written… the variables array1 and array2 have to be replaced with macros that access these using constMemPool directly:

#define ARRAY0(i) constMemPool[i]

#define ARRAY1(i) constMemPool[nElem0 + i]

#define ARRAY2(i) constMemPool[nElem0 + nElem1 + i]

which compiles under -deviceemu or not.

I believe only host functions can set values in constant memory. Try removing constant qualifiers from array1 and array2 declarations. Alternatively, you could keep those pointers in constant memory and set their values from the host function that invokes the kernel.

Paulius

How about using 1D textures?
They can be “dynamically” allocated, and serve as constant arrays during kernel execution…

I also use constant memory quite extensively. I was a bit puzzled by the following issue:

according to NVIDIA there is 64K total constant mem available (+ 8K cache per SM).

Besides that ptxas sometimes places frequently used constants also to constant mem.

According to decuda these correspond to different memory regions, compare :

sub.half.b32 $r17, $r6, $r3

add.half.b32 $r18, $r17, c0[0x0000] <-- "user-defined" constant mem

add.u32 $r6, $r6, $r3

min.u32 $r3, $r17, $r18

sub.u32 $r17, $r6, c0[0x0000]

mul24.hi.u32 $r18, $r1, $r3

min.u32 $r6, $r6, $r17

cvt.rz.f32.u32 $r17, $r18

mul.rz.f32 $r17, $r17, c0[0x0004]

shl.u32 $r18, $r18, 0x00000007

cvt.rzi.u32.f32 $r17, $r17

mad24.lo.u32 $r18, -$r12, $r17, $r18

mul24.lo.u32 $r3, $r1, $r3

and.b32 $r3, $r3, c1[0x0004]   <--- compiler generated constant mem

...

does it essentially mean that c0 and c1 memory spaces do overlap (so in fact one cannot use the whole 64K space) or

the amount of constant memory is not limited by 64K ?

I am also using constant memory as a dynamic memory pool in this manner. However, I am having a different problem.
The issue is that any file which wishes to access the constant memory must know the name of the constant array.
However, I can’t get extern to work like it should - namely, to predeclare a constant array in a header, while the actual constant array exists in only one file.

The results is the program compiles fine, but I get the runtime error:

“duplicate global variable looked up by string name”

It appears that every .cu file that includes the header is in essence creating its own copy of the constant array (or something like that.)

I try putting a declaration in the header like this (included by several different files):

extern constant int jCudaConstArrayInt[J_MAX_CONST_MEM_INT/sizeof(int)];

Then, in one file (only) I have the actual arrays as so:

constant int jCudaConstArrayInt[J_MAX_CONST_MEM_INT/sizeof(int)];

But as soon as more than one file includes the header, I get the duplicate symbol error.
Any ideas how I can make this work such that more than one module can access the same constant array?

Thanks,

  • Jeff

I am also using constant memory as a dynamic memory pool in this manner. However, I am having a different problem.
The issue is that any file which wishes to access the constant memory must know the name of the constant array.
However, I can’t get extern to work like it should - namely, to predeclare a constant array in a header, while the actual constant array exists in only one file.

The results is the program compiles fine, but I get the runtime error:

“duplicate global variable looked up by string name”

It appears that every .cu file that includes the header is in essence creating its own copy of the constant array (or something like that.)

I try putting a declaration in the header like this (included by several different files):

extern constant int jCudaConstArrayInt[J_MAX_CONST_MEM_INT/sizeof(int)];

Then, in one file (only) I have the actual arrays as so:

constant int jCudaConstArrayInt[J_MAX_CONST_MEM_INT/sizeof(int)];

But as soon as more than one file includes the header, I get the duplicate symbol error.
Any ideas how I can make this work such that more than one module can access the same constant array?

Thanks,

  • Jeff