Can I use __constant__ memory with pointer to it as kernel arg

Suppose one want to use constant memory

so I declare a global struct of my structs called devXc

constant struct Xdef devXc; /* put constant stuff in here */

Then I want to fill it from an instanct of (struct Xdef *X) in host memory.

struct devX *X= setX( …);
HANDLE_ERROR(cudaMemcpyToSymbol(&devXc,X,sizeof(struct Xdef),0,cudaMemcpyHostToDevice));

that syntax is fine with the compiler.

I want to pass a pointer to devWc into my kernel. I really realy dont like globals, and
everything else is passed in a a pointer

     gpuKernel<<<nBlocks,BlockSz>>>(nScn, devP,dev_S, &devXc,devO);

and hope to wake up in my kernel function

1 global void gpuKernel(int nScn, struct Pdef *P, struct Sdef *S, struct Xdef *X, struct Odef *O)
2 {
3 int bid = blockIdx.x;
4 int tid = threadIdx.x;
5 int blocksz = blockDim.x;
6 commonKernel(bid,tid, blocksz, nScn, P,S,X,O);
7 };

WELL The compiler likes it but I get a launch error immediately.

  1. To use constant memory, do I absolutely have to use a global reference inside the kernel,
    and not pass in a pointer devXc to the gpuKernel function?
    I really hate globals.

  2. IS there a way I can declare the constant memory as a pure constant struct Xdef *
    Then in the host, do a cudaDeviceMalloc into it, and then load it with cudaMemcpyToSymbol?

  3. Is constant memory faster than plain old global memory anyway, or am I doing this pain
    for no performance gain making X into constant memory.


I don’t know any other way to use the constant memory than just accessing it within the kernels by name without passing anything as parameters.

As for the speedup when I changed some of my parameters I passed to the kernel into constant memory I got a very large speed increase. This was for an array of values where each thread needed to read each value. I believe the advantages of constant memory are that reads are broadcast to multiple threads at once and that it is cached…

Out of interest why do you want to pass pointers to your kernels anyway if you don’t need to, am I missing something? Would this enable you to use one kernel with different sets of constant memory selected by passing pointers?

from the standpoint of program architecture, globals are sort of an abomination, Encapsulation is broken etc. So I just got over it and used constant memory and did what I had to do, and you are right. It’s faster.
I wish I had lot more of it. On by C2050 I only have 65k. I am grateful for what I have of it though.

I’m not sure if it’s fully correct but you can pass a pointer to constant memory.

For example:

__constant__ int constDev;

__global__ void doSomething(int *ptrToConst, int *ptrToGlo)


  *ptrToGlo[threadIdx.x] = *ptrToConst;



void main()


  int N = 16;

int *hostData = NULL,

     *devData = NULL;

cudaHostAlloc(&hostData , sizeof(int));

*hostData = 5;

cudaMalloc(&devData , N*sizeof(int));

int *hostPtrToConst = NULL;

cudaMemcpyToSymbol("constDev", hostData , sizeof(int));

  cudaGetSymbolAddress((void**)&hostPtrToConst, "constDev");

doSomething<<<1, N>>>(hostPtrToConst, devData);




This should work fine. The problem is constant must be declared in the same .cu file in which you call cudaGetSymbolAddress.

I’m not sure if it’s been solved with cuda 4.0. The manual reads, constant memory can be defined in a different file, but I keep on getting a ‘invalid symbol’ error or ‘symbol duplication’ if I add ‘extern’ label.

To clarify, I have GeForce 420m and I do compile with arch_20.

Is there anyone who manged to get a symbol’s address from an other .cu file ??



There is a discussion of constant memory:

Essentially it can be much faster than global memory but its not really 64Kbytes but a 64K window onto a much smaller caching system. Also use the profiler to check for warp_serialise. This can hurt performance much more than I anticipated when I first started using constant

I use the following. Its all in one .cu file and so compiled in one go by nvcc.

The cudaMemcpyToSymbol call is actually in a regular C function which is called

by regular C code on the PC compiled by gcc in the normal way.


unsigned int Constant[15*1024]; //leave 1kwords free for other constants
cutilSafeCall(cudaMemcpyToSymbol((const char*)Constant, 

				     matrixw, matrix_size, 0, cudaMemcpyHostToDevice));