How to measure the clock cycles of access to global memory?

Hi,

According to CUDA programming guide, an access cost to global memory is 400-600 clock cycles.
I want to measure the number of the clock cycles on my own.
Unfortunately, my code can’t seem to get the information correctly…

Can anyone please present me the code to measure global memory access cycle?

Thanks

http://www.stuffedcow.net/research/cudabmk