I would need a specific example but there is also initialization time to factor in. We did have an issue in early releases where the total time would be incorrect when a data region was used. Also a there was a problem when using CUDA 4.0 just after it first came. But both of these issue have been resolved.
Accelerator Kernel Timing data
c2.c
main
32: region entered 1 times
time(us): total=1182682 init=1180869 region=1813
kernels=170 data=1643
w/o init: total=1813 max=1813 min=1813 avg=1813
34: kernel launched 1 times
time(us): total=170 max=170 min=170 avg=170
here is what I am talking about. As you can see in this example the kernel + data time = region time. This holds until you get to the later examples in many of the whitepapers. Then it no longer holds.
Initialization time has nothing to do with it. Why does this hold sometimes and not at other times.