I have a question about reading from constant memory versus reading from global memory in the context of a compute capability 2.0 device.
If all the threads in a half warp read the same 4 byte word in global memory, then there will be one 128 byte read request from the L1 cache then if miss L2 cache, then if there’s another miss, a read request from global memory. The loaded 4 bytes will then be provided to each thread in the half warp
If subsequent half warps request that same memory address, that data is likely cached, so memory loading will be quick.
Am I right to assume the same sort of thing happens for constant memory?
i.e. all threads in a half warp read the same 4 byte word in constant memory, there is then one 128 byte read request from the constant cache; if there is a constant cache miss then there is a request from constant memory (which is as slow as a global memory).
If subsequent half warps request that same constant memory address, that data is likely cached, so memory loading will be quick.
What then, is the advantage of using constant memory? I have read that constant memory reads will be “broadcast” to the entire half-warp, provided all threads in that half-warp request the same constant memory address. But would this not happen for global memory accesses as well, as I describe above?
Additionally, I just wanted confirmation that there are no profiler counters to measure constant memory requests/constant cache hits & misses?