I don’t think we are in disagreement. But what exactly end-to-end means will depend very much on the use case. Let us assume as an example using a matrix multiply. Let us further assume the kernel execution for a given size matrix takes 5 ms, and copying the data either host->device or device->host takes 2 ms. The same-size matrix multiply on the CPU takes 100 ms. What speedup should be reported for the matrix multiply?

Use case 1: The matrix multiply is called once, so there is no chance of keeping data resident on the GPU, nor of overlapping copies and kernel. That is, the cost of the matrix multiply end-to-end is 2+5+2=9 ms, thus end-to-end speedup is 100/9 = 11.1x

Use case 2: The matrix multiply is called five times, and by using async streams, copies are overlapped with the kernel execution, such that the device->host copy for the result of the previous kernel overlaps with the execution of the current kernel, as well as the host->device copy of the succeeding kernel. Maybe the overlap isn’t 100% perfect (there are always a few friction losses), and the five matrix multiplies execute, end-to-end, in 2+26+2 = 30 ms, so the end-to-end speedup versus the CPU is 500/30 = 16.7x

Use case 3: The matrix multiply is called 50 times, and the data consumed and produced is entirely resident in the GPU during the computation (e.g. we are doing a naive matrix exponentiation), and data is copied between host and device only at the start and end of the sequence, so total time is 2+50*5+2 = 254 ms, resulting in an end-to-end speedup of 5000/254 = 19.7x