The SDK and Programming Guide are pretty sketchy on the topic of allocating and initializing constant memory.
Though several posts provide hints here and there, a single reference point would be very helpful!
Specifically, I’m unclear on how to dynamically allocate constant memory. Would this be similar to dynamically allocated shared memory – i.e., using a single base array and offsetting into this array?
When I say “dynamically allocate” constant memory, I mean in the sense analogous to what is described in the Programming Guide 0.8 (pp. 19-20) for shared memory:
I have several arrays whose number and length may vary from kernel invocation to kernel invocation. (Specifically, each array contains a set of discount factors or dividends for multiple times in the future.) These naturally fit into constant memory because:
a) they are constant for the life of the kernel invocation;
b) they are shared across blocks;
c) they will benefit from cacheing, since a particular thread block will use only one set of arrays;
d) they are too big in general to fit into the shared memory.
My question is, how do I declare these arrays? Since their size is variable at kernel invocation, I can’t declare them statically.
I was trying something like this, analogous to shared memory “allocation”:
But I’m still unclear on exactly how to (correctly!) initialize constant memory. (The Programming Guide doesn’t offer much help, either.)
Do you mean, e.g., running out of shared memory? The docs say the GeForce 8800 series have 64KB of constant memory, with 8KB cache per multiprocessor. My constant data can fit in 64KB, and the cache will hold what is necessary for each thread block. “For all threads of a warp, reading from the constant cache is as fast as reading from a register as long as all threads read the same address.” (Programming Guide 0.8, p. 58) So I think constant memory is the way to go in my case.
Looking at the .ptx files, it seems to me that constants are embedded into executable at compile time (that’s why you have to copy bytes to addresses specified by symbols to set constant values at run-time ). So, you wouldn’t be able to allocate constant arrays of different sizes for different invokations of the same compiled kernel.
I think you should be able to use texture memory for your purposes. Texture sizes can be set dynamically and they are cached. Just make sure to sample in the exact texel center to avoid filtering (unless filtering helps you).
Actually, if you know the maximum size of the array you’d like to put in constant memory, you could just statically allocate that and copy only the desired number of bytes at run-time. I haven’t tried constant arrays, but they ought to work fine.
Here’s how I’ve used constant qualifier for scalar variables:
__constant__ int constantN;
__global__ function(...)
{
int x = constantN*constantN;
....
}
int hostFunction()
{
int dim = 512;
cudaMemcpyToSymbol(constantN, &dim, sizeof(dim));
....
}
Perhaps there’s a better way. I can’t say that this is any better than just passing an argument to the global function (passing arguments may actually lead to more readable code), since the constant must be loaded into a register to be useful. However, it makes good sense if you have arrays of constants. Just make sure that all your constant data fits (Section 5.1 of the programming guide).
Here’s what I’ve done – works under the emulator so far:
__constant__ float constMemPool[16384]; // ... or some other big enough number
__global__ void kernel(unsigned int nElem0, unsigned int nElem1, unsigned int nElem2, float *g_Out)
{
__constant__ float *array0 = constMemPool + 0;
__constant__ float *array1 = array0 + nElem0;
__constant__ float *array2 = array1 + nElem1;
// do some things here that reference these arrays
// for example:
float x = array1[threadIdx.x];
...
}
void host_function()
{
...
// h_arrayIn contains the elements of all input arrays "linearized"
// so they may be copied into constMemPool
cudaMemcpyToSymbol(constMemPool, h_arrayIn, cbIn, 0);
kernel<<<grid, threads, cbSharedMem>>>(n0, n1, n2, d_vfOut);
...
}
All this seems to work… (famous last words)
… but, I’m guessing the constant memory allocated to constMemPool is blocked from all other uses until the application terminates.
That’s not a problem if this is the only application using the GPU, since I can reuse constMemPool for other kernel calls.
However, if multiple applications are competing for the GPU’s resources, could the GPU easily run out of constant memory (e.g., 64KB limit for G88 series)?
So, I guess this won’t work quite as written… the variables array1 and array2 have to be replaced with macros that access these using constMemPool directly:
I believe only host functions can set values in constant memory. Try removing constant qualifiers from array1 and array2 declarations. Alternatively, you could keep those pointers in constant memory and set their values from the host function that invokes the kernel.
I am also using constant memory as a dynamic memory pool in this manner. However, I am having a different problem.
The issue is that any file which wishes to access the constant memory must know the name of the constant array.
However, I can’t get extern to work like it should - namely, to predeclare a constant array in a header, while the actual constant array exists in only one file.
The results is the program compiles fine, but I get the runtime error:
“duplicate global variable looked up by string name”
It appears that every .cu file that includes the header is in essence creating its own copy of the constant array (or something like that.)
I try putting a declaration in the header like this (included by several different files):
extern constant int jCudaConstArrayInt[J_MAX_CONST_MEM_INT/sizeof(int)];
Then, in one file (only) I have the actual arrays as so:
constant int jCudaConstArrayInt[J_MAX_CONST_MEM_INT/sizeof(int)];
But as soon as more than one file includes the header, I get the duplicate symbol error.
Any ideas how I can make this work such that more than one module can access the same constant array?
I am also using constant memory as a dynamic memory pool in this manner. However, I am having a different problem.
The issue is that any file which wishes to access the constant memory must know the name of the constant array.
However, I can’t get extern to work like it should - namely, to predeclare a constant array in a header, while the actual constant array exists in only one file.
The results is the program compiles fine, but I get the runtime error:
“duplicate global variable looked up by string name”
It appears that every .cu file that includes the header is in essence creating its own copy of the constant array (or something like that.)
I try putting a declaration in the header like this (included by several different files):
extern constant int jCudaConstArrayInt[J_MAX_CONST_MEM_INT/sizeof(int)];
Then, in one file (only) I have the actual arrays as so:
constant int jCudaConstArrayInt[J_MAX_CONST_MEM_INT/sizeof(int)];
But as soon as more than one file includes the header, I get the duplicate symbol error.
Any ideas how I can make this work such that more than one module can access the same constant array?