Timing code On a fermi....

While attempting to time my code on a fermi device, I notice that the caching is messing with my numbers. I tried using .cv to cache as volatile on the first load, but then as far as I can tell it makes all subsequent loads to said memory address also volatile. I have an inner loop where I get an average time, so is there something I could put at the beginning of the loop to invalidate all cache lines?

I know its kinda of an odd question, thanks for reading though!
~Clamport