Constant / global memory

Hi ,

In Optimization) , access local memory latency is ~100x smaller than global memory .

But it didn’t said the latency about accessing constant memory , is there any document to study ?


You can have up to 64K of device memory set aside for constant access. The first time this memory is accessed it’s the same latency as any other type of device memory (200+ clocks). But once it’s in the cache close to compute the latency can be no more than that of the register bank (basically zero). Cache lines are 256 bytes I think (the first request of a 256 block can be expensive but it’s free for any subsequent accesses inside that same block). I forget how big the close to compute cache size is. There may be an additional 64 byte cache that’s even closer to the cuda cores to get you register access speeds, but it probably isn’t worth designing around.