Constant memory per multi processor

The programming guide states that there is 64KB constant memory with a cache working set of 8KB per multiprocessor. Can one and how does one set the constant memory for a specific multiprocessor?


It’s a cache, so one in principal should just read the memory like normal and the cache will take care of everything.

So you believe it is (64*16)KB (for gtx) of constant memory with 16 times 8KB of cache (and each 8KB is shared between the blocks running on that multiprocessor). That would be great.


No, there is only 64 Kb of globally available constant memory. Just try to write a program that allocates more in a static array, you get a compile error IIRC. And the 8KB per multiprocessor is just a cache working set. As you access various values from the constant memory, that 8KB fills up with previously accessed values (and possibly nearby values in memory) so that later memory access do not need to pay the full latency cost for accessing that same memory location again. As you access various memory locations from a multiprocessor more than 8KB, old entries in the cache will expire.

I wonder how this works if you have multiple .cu compilation units in your program. Is the total amount of constant memory in your entire program fixed at 64Kb, or just that for one compilation unit. The second sounds more plausible.

MisterAnderson64 is right. The per multiprocessor in the programming guide applies to the cache and not to the 64K constant memory (So its a total of 64KB of constant memory and a 8KB cache per multiprocessor). Multiple cu’s containing each there own constant still restricts the aggregate to 64K.


This might be like the texture reference problem. The limit ought to be per-kernel, but ends up being per-program. manual compilation and the driver api might fix it.

The 64KB limit should be per kernel (per execution).

But that would force the driver to copy 64KB from the device to host and back every time you call a kernel. People are already complaining about driver overhead. As it is, one can put a bunch of data in const memory and use it from every kernel in the application without ever needing to copy it again.

Why would that mean more overhead? I understand that every cudaMemcpyToSymbol would create overhead. But if you don’t want to change the constant memory just leave it.

It should be per kernel (or, .cu file) because now you have no idea whether some other part of the program uses CUDA too and might consume some constant memory. You never know how much you can use without messing up the rest of your program.
This kind of messes up any modular system, like a multimedia pipeline.

Right, exactly my point. There isn’t any overhead right now. Many kernels can all share the same constant memory and life is good from my perspective, since this is how I am using constant memory.

But, others here are proposing a system where each kernel has it’s own constant memory space. Assuming that there is only one constant memory area on the device, the driver would then be required to do a bunch of twiddling around with the constant memory on every kernel call to make sure it is up to date.

It doesn’t matter because NVIDIA has given us what they have given us.

This means you are only writing very small, specific applications that use CUDA. You might imagine that if you have a big system, made by multiple developers, you will quickly lose track of the total amount of constant memory used. Even more if parts are loaded/unloaded as shared libraries.

It would be different if you could allocate and de-allocate constant memory on the fly, then you could release it if it was done. Now, a kernel sitting around doing nothing for the entire span of your application but 5 seconds, which still requires 10kB of constant memory, will hob that all the time. That’s just bad…

My application is 100% targeted at CUDA yes, but it is not a tiny application. Currently, I’m only using a small portion of the constant memory, but as I add additional features (= more kernels) it will probably be eaten up rather quickly. At that point I plan on evaluating the use of textures instead of constant mem in those kernels where the performance loss is not significant.

I’m not trying to say that the system is perfect as is. I’m just pointing out that blindly having the driver switch between different 64KB blocks of constant memory per kernel is a bad idea, as it will remove the useful functionality of sharing those constant values among many kernels operating together on the same problem. I agree that there is a need for some kind of allocation/de-allocation scheme.

DX saw a lot of emphasis being put on constant memory with version 10. Most of this capability isn’t exposed. For example, a shader can actually access 16 buffers at once, which can be up to 64kb each. A shader is allowed to write its results into another’s constant buffer. Finally, the buffers fit one of several storage types, which restrict who can write to them and in so doing allows the driver to optimize.

Constant buffers are so flexible that the fundamental distinction between constants and textures seems blurred. In fact, it’s more blurred than that. In DX10 there are two families of resources: textures and buffers. Buffers, meanwhile, come in two flavors: constant buffers and texture buffers. Uhuh. In any case, the difference between a true texture and any kind of buffer is that a texture can do tricks like mipmapping and filtering. Meanwhile, a buffer is often optimized as a means for the CPU to write data and for the GPU to read it (but not always, see above).

Where this leaves CUDA is uncertain. Worse, patterns emerge that don’t quite make sense. CUDA’s distinction between cuda arrays and linear textures looks a lot like DX10’s distinction between true textures and texture buffers. Then are 1D linear textures the same thing, underneath, as cuda constants? In Apdx A, The Guide conspicuously only specifies the “cache working set for one-dimensional textures,” which turns out the same size as for constants. Yet what about two-dimensional textures? Do they have a separate cache? Did they used to, but now it’s called shared memory?

A lot of questions, yet what seems certain is that NVIDIA is not too concerned with clarifying the situation. In fact, it seems they don’t like the whole confusing business of textures and constants to start with (but who can blame them). We’ll almost certainly not see the intricate heirarchy of resources and storage classes that DX sets up, but will we at least be able to make practical use of all those memory classes that aren’t either global, local, or shared?

I do come away with the feeling that all this could be made much more sane if the concepts were all refactored. Can’t we just declare a resource and say whether/where we want it cached, on what level is it shared (thread, block, kernel), and whether to turn on filtering? Every permutation doesn’t need its own name.

constant is comparable to what is called ‘shader parameters’ in the programmable graphics pipeline (also known as uniforms, or even, shader constants). At least for DirectX9 and OpenGL. I have never looked at DX10 so I don’t know what they call it.

My point was that things changed 360 degrees from directx9, so why the comparison?

My guess is NV exposed the true thing in CUDA and just emulated DX10 constant buffer.

I just tested to find const buffer writing is actually supported in CUDA. Just get the address on host and pass it to a kernel as a global pointer. The write indeed succeeded and next pass got the written value. I used driver API.

Constant cache also doesn’t feel like texture cache, since divergent access at continuous address degrades performance.

Because the comparison is obvious. It makes just as little sense to compare CUDA to DX9 as to DX10. It’s even possible that some CUDA features are not exposed in any graphics APIs at all.