Timing Comparisons

I am interested in some guidance on how to do timing comparisons properly. I want to benchmark several methods, implemented as kernels, and I am not sure if the memcpys afterwards need to be part of the timing comparison. A kernel runs in a really short time, but with large variability: One run it takes 288 ns, in another run this same kernel takes 312 ns. That is a wide percentage-based variability, so my only choice is to run it many times, e. g. 1024 times, and study the distribution/histogram, compute mean and standard deviation.
But as this is on the GPU, I think the memcpy time must be included, because we wouldn’t really want to compute anything on the GPU if we wouldn’t want anything returned to the host. Or would we? After all the point is compute timing comparison, we want to benchmark the kernel computation times, and a kernel doesn’t know anything about memcpys.
Can you convince me one way or the other, whether or not the memcpy time should be included in the timing comparison?
Do I sound like drama queen over a non-issue?