boilerplate code for timing

I am trying to implement some timing boilerplate codes to compare the timing for using a tree sum versus using just a loop serial sum on my GPU. Can anyone help me find the boilerplate code to time how long it takes to launch a GPU kernel and return its results?

Use cuda events and create an event before and after the kernel launch using cudaEventRecord, synchronize to it using cudaEventSynchronize, then take the time delta using cudaEventElapsedTime.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/#events