Constant memory usage in CUDA code

I can not figure it out myself, what is the best way to ensure the memory used in my kernel is constant. There is a similar question at c++ - How to use CUDA constant memory in a programmer pleasant way? - Stack Overflow.

I am working with GTX580 and compiling only for 2.0 capability. My kernel looks like

global Foo(const int *src, float *result) {…}

I execute the following code on host:

cudaMalloc(src, size);

cudaMemcpy(src, hostSrc, size, cudaMemcpyHostToDevice);

Foo<<<…>>>(src, result);

the alternative way is to add

constant src;

to .cu file, remove src pointer from the kernel and execute

cudaMemcpyToSymbol(“src”, hostSrc, size, 0, cudaMemcpyHostToDevice);

Foo<<<…>>>(result);

Are these two ways equivalent or the first one does not guarantee the usage of constant memory instead of global memory? size changes dynamically so the second way is not handy in my case.

As far as I know using ‘const’ as a kernel parameter modifier means the same as in host C/C++ - You cannot change such a variable, it is read-only. The ‘constant’ memory array/pointer/variable means that when accessing it one read will be done for half-warp of threads. See CUDA documentation for more precise information, please.

Well, on StackOverflow it is written that

That’s why I am so interested.

specifying the pointer as const is a protective measure. It declares that the data is input only and not input / output and should not be changed in the function. The compiler checks that it is not written to and is allowed to make optimizations based on that assumption. It does not alter the location of the memory.

Using the const keyword says that the data is accessed via the constant cache, which is something completely different. Although theoretically possible (as OpenCL works this way), I doubt that the compiler will automatically read the input via the constant cache as it’s highly prone to performance problems.

  1. The constant cache is limited in size (not just the cache, but the entire buffer size is limited to 64KB per kernel)
  2. The constant cache has different required access patterns than global memory. It only works well if all threads in the warp access exactly the same memory address (a scatter operation) otherwise there is access serialization.

constant cache is very good for storing things such as convolution kernels, but is not very good for general purpose access.