After reading some into the differences between Constant/Texture/Global memory, I was curious if it were possible to actually measure possible cache misses a CUDA program goes through during debugging.
On current hardware, this is either not possible or not exposed as a perf counter to cudaprof. Here’s hoping that Fermi adds some good perf counters for its cache.