Use of constant caches for large data?

What is the maximum size of the constant memory pool one can create? For several applications I have in mind there are large blocks of coordinate data where every element is used by every thread. The constant memory seems like the ideal place to store these values since every thread in a warp will access the same constants at the same time, thus providing the register-speed access (once cached)
described in the CUDA programming guide. Since none of the examples demonstrate the use of constant memory I’m curious if the constant memory can be of arbitrary size, or if they are only statically allocated items limited to 64KB or less. (the size 64KB is mentioned on page 49 of
the CUDA programming guide). The caching behavior of constant memory is highly desirable
for the access patterns my application will use, so if it’s possible to create large (10MB or more) constant arrays, that would be highly desirable. If not, then perhaps texture memory is the next best option. Any comments you have are appreciated.


I’ve tried using cudaMemcpyToSymbol() to a constant array of fixed size, but I seem to get garbage out of the constant array regardless. I defined:

__constant__ float4 dstuff;

then later in the code I have:

  float *hstuff = (float *) malloc(SIZE*4 * sizeof(float));

 ... init the hstuff array with data which will go into the constant array

cudaMemcpyToSymbol("dstuff", hstuff, 4 * SIZE * sizeof(float), cudaHostToDevice);

(no errors occur on this call)

Then inside the kernel, I reference the constant data in a loop that looks roughly like:


  for (index=0; index<SIZE; index++) {

    float dx = coor.x - dstuff[index].x;

    float dy = coor.y - dstuff[index].y;

    float dz = coor.z - dstuff[index].z;

    float r_1 = 1.0 / sqrt(dx*dx + dy*dy + dz*dz);

    sum += stuff[index].w * r_1;


I'm getting gibberish out of the constant buffer however..


First, always check for errors after you call a CUDA function (like cudaMemcpy) using cudaGetLastError / cudaGetErrorString. (Edit: always check for errors in your DEBUG build – error checking is not free so you might want to check less frequently in a release build, or only on mallocs.)

Second, what size are you trying to allocate? To answer your first question, I believe the max allocatable size is 64KB.


I did check for errors, but none occured. I tried various sizes, but found that allocating much more than 16KB would cause the kernel not to execute. (vague invalid parameter error occurs if the size is too big…)


Actually the size that caused the kernel not to execute was more like 64KB, I’ll re-check it later today to make sure.

OK, someone here pointed out your problem:

cudaMemcpyToSymbol("dstuff", hstuff, 4 * SIZE * sizeof(float), cudaHostToDevice);

Passing “cudaHostToDevice” as the last parameter is incorrect. cudaMemcpyToSymbol ALWAYS copies from host to device (by definition). The fourt parameter is an offset into the array named by the symbol to which you want to copy the data. You probably wanted 0 here.

Also, you should use dstuff, not “dstuff”.

So the correct code would be:

cudaMemcpyToSymbol(dstuff, hstuff, 4 * SIZE * sizeof(float), 0);


deleted accidental duplicate post

Thanks for catching those, the constant buffer now appears to work for me. I believe David Kirk mentioned (in a lecture earlier this week) that DirectX10 requires an unlimited size constant buffer and that this required that the G80 be able to cache huge constant pools. Is there any particular reason that CUDA only lets us access a 64KB constant area? I have a number of applications where using a very large constant pool (many megabytes) would be very beneficial. For the time being I’ll try using texture fetches and see how well that works, but I’m curious if there’s any chance that CUDA would
eventually allow a larger sized constant area, or if I should just forget about this strategy and focus on using texture fetches for large constant data instead.


Interestingly, I implemented both an iterative solution to my algorithm using the constant buffers, and also a single-pass implementation using texture fetching, and the constant buffer method outruns the texture fetch method by a factor of two, even though I have to update the constant buffer and subsequently execute the kernel many times in order to process the contributions from the full dataset. For the iterative implementation, I can execute kernels at a rate of around 200 per second before the performance begins to decrease due to execution overhead. (using a smaller slice of the 64KB constant buffer, and doing a larger number of overall iterations). I’m wondering if there are ways I can get the texure-based method to run faster. All threads in all of the blocks basically loop over all of the elements in the constant data, performing some calculations and accumulating results in a local variable. As such, they all read the same constant data at the same time, at least in the current implementations of both variations of the algorithm. For the constant buffer based method, this is ideal since my understanding of the user’s guide leads me to believe that I’m getting register-speed access to the constant data since they’re all reading from the same item. For the texture based method, I’m wondering if I’d get better performance if each thread read from a different texel?? Or are texture fetches just going to be inherently slower than using the constant pool for an algorithm that reads every constant element in sequence?

Any comments or suggestions are appreciated.


I think you could also consider using shared memory to load the constant values. This probably requires a more sophisticated kernel but you do not have to replenish the constant array with new values every now and then.

I considered the shared memory approach, but I’ve subsequently done timings on the constant buffer code, comparing the performance-per-work-unit for a dataset that fits entirely in the constant buffer versus one that requires 66 iterations, and there’s only a ~1% difference in performance, so I don’t think that I’d gain anything from using the shared memory instead of the constant buffer. For this specific data access pattern, all of my threads access the same constant data at the same time, so I should (according to the user’s guide) achieve register-speed constant data access rates. So the only reason I’d have to change to using the shared memory area instead would be if it allowed me to have a larger per-iteration working set than I can fit into the constant area, but this doesn’t seem to be the case from what I’ve read about the size of the shared memory area and the fact I’d have to have multiple copies of things for each thread to prevent them from having bank conflicts, if I understand all of that correctly. Since the reloading of the constant pool only seems to cost 1% of my execution rate, I think I’ll stick with the constant buffer method for now. I’m still curious if there’s a way to make the texture fetch method work as well as the constant pool does…