What happens on a constant cache miss?

I have a Kernel program that uses 400B of constant memory. In the kernel, I access 32-bit chunks of the constant memory serially. When there is a cache miss, does the GPU go out and fetch a continuous block of constant memory or does it just fetch the 32-bit?

Also is it possible to control how much data the GPU fetches when there’s a cache miss? For example on the first cache miss, the idea is to have the GPU fetch all 400B and stick it into the nearby constant cache.



The programming guide contains all the information you’re likely to get on the matter, which isn’t much.

If I had to guess though, I’d say the minimum amount it brings in to the cache would be 16 32-bit values. I only guess this because that is what warps need for coalesced reads, shouldn’t the read into the cache also be coalesced by the hardware?

Really, though, I doubt that prefetching vs. not prefetching your 400B is really going to make any difference in the total run time of your app. What matters more in accessing constant memory is that all threads in a warp access the same 32-bit word at the same time.