Managing Constant Memory

Hi,
I need some advice regarding the Cuda architecture constant memory management.
Throughout the Cuda documentation, programming guide, and the “Cuda by Example” book, all I seem to find regarding constant memory, is how to assign/copy into a constant declared array, by using the cudaMemcpyToSymbol() function. But there’s never any mention on how to modify or “free” this allocations. ( Unlike Texture memory, which can be unbinded )

Regarding modification:

I’m working on a problem, were I have to update the values of my constant memory array after each kernel invocation. While searching for answers, I read that it wasn’t possible to modify constant memory, once it had been assigned, but I recently found this post in this forums, which shows it’s actually possible to do what I need:
http://forums.nvidia.com/index.php?showtopic=28780&view=findpost&p=163129

My guess is that, by calling cudaMemcpyToSymbol(), I can modify this values before each call to my kernel. Is this correct?

Regarding allocation:

What if I need certain amount of constant memory, say 64k for a table of integers at one point, and later on I need another table of another 64k of floats, and I don’t longer need the first table of integers. Is there a way to “free” the first table, in order to allocate the second table?

As far as I understand, constant memory allocation is done at compiling time, which means that I can’t allocate different amounts or sets throughout my program.
Is there a way around this?

I was thinking on using Texture memory, to allow the dynamic allocation of my tables. Yet I was really looking for the Broadcasting benefits of the Constant memory, and not the Spatial locality caching benefits from Texture memory.

Thanks in advance,
Rodrigo

Hi,
I need some advice regarding the Cuda architecture constant memory management.
Throughout the Cuda documentation, programming guide, and the “Cuda by Example” book, all I seem to find regarding constant memory, is how to assign/copy into a constant declared array, by using the cudaMemcpyToSymbol() function. But there’s never any mention on how to modify or “free” this allocations. ( Unlike Texture memory, which can be unbinded )

Regarding modification:

I’m working on a problem, were I have to update the values of my constant memory array after each kernel invocation. While searching for answers, I read that it wasn’t possible to modify constant memory, once it had been assigned, but I recently found this post in this forums, which shows it’s actually possible to do what I need:
http://forums.nvidia.com/index.php?showtopic=28780&view=findpost&p=163129

My guess is that, by calling cudaMemcpyToSymbol(), I can modify this values before each call to my kernel. Is this correct?

Regarding allocation:

What if I need certain amount of constant memory, say 64k for a table of integers at one point, and later on I need another table of another 64k of floats, and I don’t longer need the first table of integers. Is there a way to “free” the first table, in order to allocate the second table?

As far as I understand, constant memory allocation is done at compiling time, which means that I can’t allocate different amounts or sets throughout my program.
Is there a way around this?

I was thinking on using Texture memory, to allow the dynamic allocation of my tables. Yet I was really looking for the Broadcasting benefits of the Constant memory, and not the Spatial locality caching benefits from Texture memory.

Thanks in advance,
Rodrigo

If you can target compute capability 2.x devices, you can rely on the L1 and L2 cache and just use global memory for your lookup tables. Otherwise, you are right that constant memory allocation size is “baked into” the kernel by the compiler and can’t be altered at runtime.

If you can target compute capability 2.x devices, you can rely on the L1 and L2 cache and just use global memory for your lookup tables. Otherwise, you are right that constant memory allocation size is “baked into” the kernel by the compiler and can’t be altered at runtime.

Thanks for the information. I’ve been reading a bit about Fermi’s L1 and L2 cache, and as far as I could understand, it will only benefit the concurrent or reiterative reads from one single address, after it’s been fetched ( Basically how any cache works ). That won’t really give me the warp broadcasting feature from Constant Memory that I need for the table access. I understood that the L1 cache is basically shared memory being used as cache, and in that case, won’t I get better results by pre-storing my tables in shared memory, which does feature a broadcasting mechanism access?

Thanks for the information. I’ve been reading a bit about Fermi’s L1 and L2 cache, and as far as I could understand, it will only benefit the concurrent or reiterative reads from one single address, after it’s been fetched ( Basically how any cache works ). That won’t really give me the warp broadcasting feature from Constant Memory that I need for the table access. I understood that the L1 cache is basically shared memory being used as cache, and in that case, won’t I get better results by pre-storing my tables in shared memory, which does feature a broadcasting mechanism access?

Keep in mind that constant memory is just global memory being accessed through a small (6-8 kB) cache on the multiprocessor optimized for broadcast. The first fetch of a word still has to go out to global memory, and if your table is larger than the cache, words will be evicted and possibly require refetching. You are correct that if you can fit your entire lookup table into shared memory, that will be as good, if not better, than relying on the L1 cache.

To decide whether constant memory or cached global memory is better for table lookup on Fermi, I think we need a small microbenchmark. I don’t think I’ve seen anyone directly compare those two cases.

Keep in mind that constant memory is just global memory being accessed through a small (6-8 kB) cache on the multiprocessor optimized for broadcast. The first fetch of a word still has to go out to global memory, and if your table is larger than the cache, words will be evicted and possibly require refetching. You are correct that if you can fit your entire lookup table into shared memory, that will be as good, if not better, than relying on the L1 cache.

To decide whether constant memory or cached global memory is better for table lookup on Fermi, I think we need a small microbenchmark. I don’t think I’ve seen anyone directly compare those two cases.

Constant memory is local to the source file in which it is declared. There is a 64k limit per file.

Put several huge tables into different .cu modules - and voila you have different 64kb pages of cached constant memory available in each module. I am using that strategy without any problems.

Christian

Constant memory is local to the source file in which it is declared. There is a 64k limit per file.

Put several huge tables into different .cu modules - and voila you have different 64kb pages of cached constant memory available in each module. I am using that strategy without any problems.

Christian