Method to measure cache misses? Access times etc?

After reading some into the differences between Constant/Texture/Global memory, I was curious if it were possible to actually measure possible cache misses a CUDA program goes through during debugging.

On current hardware, this is either not possible or not exposed as a perf counter to cudaprof. Here’s hoping that Fermi adds some good perf counters for its cache.