Delay loops/wait loops in cuda

For many applications we need to generate clock of a certain frequency. Apart from dedicating a thread to keep counting and toggling a value, is there any better/efficient way of generating a clock?