I’ve been getting started reading the CUDA technical documentation and have noticed that there isn’t much description about how constant memory works on Nvidia GPUs. For example:
I’ve read that I should expect 64kB of constant memory, but that this is present in global memory and cached on chip. Does this mean that one can have an arbitrary amount of RAM declared as constant, with 64k at a time cached on chip, or that the entire GPU environment only has room for 64kB of constants?
Likewise, each multiprocessor has 8kB of constant cache with CC 1.0, so how is this different from the 64kB on chip? If it’s a subset of the total constant memory, what’s the line size and replacement policy to expect?
Thanks for any clarifications…
PS: I hope to get started adapting some computational number theory code to GPUs, specifically the various phases of the number field sieve (starting with the polynomial selection)
It does not mean that you can have an arbitrary amount of RAM, 64K of which would be cached.
Constant memory is 64K total. When you access some of it, it will be cached, so future accesses to it will have 0 or 1 clocks access time. It can only be written to from the host (the “constant” in constant memory refers to the fact that it cannot be written to from the device). I have used this with quite some benefit, resulting in speedups of existing code by a factor of 2-3. This, together with loop unrolling, would speed up some code from 70x faster than CPU to about 350x faster than CPU. The first access to constant memory is slow (global memory), after that it is fast.
What do you mean by the 8kB - the registers or shared memory. There is no 8kB “constant cache” I know of, there is 8kB registers on 1.0/1.1 devices (e.g. the 9650 in my laptop) and 16 kB on 1.3 devices (e.g. the GTX260 in my desktop). That is NOT “constant” memory.
IIRC there is a constant memory cache (as you’ve said “When you access some of it, it will be cached”). I don’t remember whether it was 8KB or 4KB per MP. There’s also a texture memory cache (16KB per TPC in CC<1.2 and 24KB per TPC in CC>=1.2 devices, thus 8KB per MP) and all that is in addition to the register file and shared memory (which are both bigger than 8KB).
There might be some confusion because the entire cmem is as small as a CPU’s cache would be :) But there’s an even smaller cache for it nevertheless.
So, constant memory is 64KB of device-read-only memory that gets cached within the 8KB caches each MP has. It’s perfect for data that is read by all threads in a block simultaneously in broadcast mode. If a single address in constant memory is being read by all threads within a warp and if it happens to be cached, it’s as fast as registers. If it’s not cached, there’s a single (I presume) fetch from the 64KB repository which has about the same latency as a fetch from global memory.