If you want to time the kernel accurately, use CUDA events. For example, look at the simpleStreams sample in the SDK to see how to use events for timing. Event API is described in the Programming Guide. Note that events are recorded on the GPU, so you’ll be timing only GPU execution. The nice benefit is that clock resolution is the period of the GPU shader clock - you should get reliable timings even from a single kernel launch.
If you want to time operations including CPU involvement (like driver overhead), you should use your favorite CPU timer. Just make sure you understand the timer resolution. Also, as seibert pointed out, make sure to call cudaThreadSynchronize() before starting and then again before stopping the timer.
Do not ever use blocking CUDA calls (like memcopies) to achieve synchronicity - that will change your timings terribly.