How to report speed-up from the GPU vs. CPU?

Arakageeta · January 7, 2011, 6:07pm

In the published research, do researchers include the data-staging phases of their computation into execution time when making comparison to CPU-only implementations?

For example, suppose I want to multiply to matrices A and B. My phases of execution may look like this: 1) Send A to the GPU. 2) Send B to the GPU. 3) Execute kernel. 4) Read back result. Would a researcher only compare the summed execution time of steps 1 through 4 against a CPU-only solution, or would they only compare the execution time of step 3 to a CPU-only solution?

It seems to me that the most honest method would be to include all steps 1 through 4, but I wanted to post here to see researchers commonly “gaming” their results. What do you think?

njuffa · January 7, 2011, 6:32pm

Wouldn’t it depend on the use case? Consider the following cases:

(1) A lengthy computation involving many kernel invocations in which all data is kept resident on the GPU. Data is copied between CPU and GPU only at the start and end of the entire computation, which may last many seconds or even minutes.
(2) Computation on the GPU is intermixed with CPU computation plus GPU<->CPU transfers in such a way that host/device copies run concurrently with kernel execution, such that there is almost perfect overlap between kernel execution and copies. This is enabled by CUDA’s streams.
(3) Computation in which data exchange between device and host cannot be (sufficiently) overlapped with kernel execution on the GPU, e.g. due to data dependecies.

In moving computation from CPU to GPU, (1) would be the ideal case one would strive for, but it may not be achieveable when only parts of a complex computation are being moved over to the GPU. In that case, one would next strive to implement scenario (2), which is frequently achievable. In either scenario (1) or (2), there is no or negligible copy overhead, i.e. the speedup GPU vs CPU is dominated by the kernel execution time vs CPU execution time. Only in the third scenario would copy time impact the computational speedup.

Arakageeta · January 7, 2011, 8:15pm

If it’s negligible, then there’s no harm in including it, right? To make a fair comparison between a GPU and CPU-only implementation, I feel you need to consider the entire end-to-end computation. Otherwise you’re comparing apples to oranges. I just want to verify that this is how it’s usually done. My feeling is that most researchers do consider the end-to-end computation, but I just wanted to check what the community felt. I mean, it’s a bit unfair to compare radix sort on a huge amount of data if you don’t also include the data transmission time. Consideration of the end-to-end computation also helps identify data-bound problems. It also allows us to make fair comparisons between discrete GPUs and integrated GPUs.

I guess it all depends on what you want to measure, but consider the sample eigenvalue program a part of the CUDA SDK. The timer only measures kernel execution time. But what good is the speed of a computation if you haven’t yet read back the results? The timer is halted before processResultDataLargeMatrix() is called, which reads back the results from the GPU. The eigenvalue program doesn’t try to assert a speedup over a CPU implementation, but why would I care about kernel execution time when what I really care about is how long it takes for me to get my results?

eelsen · January 7, 2011, 9:18pm

If it’s negligible, then there’s no harm in including it, right? To make a fair comparison between a GPU and CPU-only implementation, I feel you need to consider the entire end-to-end computation. Otherwise you’re comparing apples to oranges. I just want to verify that this is how it’s usually done. My feeling is that most researchers do consider the end-to-end computation, but I just wanted to check what the community felt. I mean, it’s a bit unfair to compare radix sort on a huge amount of data if you don’t also include the data transmission time. Consideration of the end-to-end computation also helps identify data-bound problems. It also allows us to make fair comparisons between discrete GPUs and integrated GPUs.

I guess it all depends on what you want to measure, but consider the sample eigenvalue program a part of the CUDA SDK. The timer only measures kernel execution time. But what good is the speed of a computation if you haven’t yet read back the results? The timer is halted before processResultDataLargeMatrix() is called, which reads back the results from the GPU. The eigenvalue program doesn’t try to assert a speedup over a CPU implementation, but why would I care about kernel execution time when what I really care about is how long it takes for me to get my results?

Njuffa’s point was that if you end up using the result of the radix sort or the eigenvalues ON THE GPU, then timing the data transfer time back to the host doesn’t make sense. Furthermore, if the input to the radix sort was generated on the gpu, then you don’t need to time the transfer to the device either. Since most real world usage scenarios are complicated with some people needing to upload/download results and others not, I think just publishing both sets of data makes the most sense…

njuffa · January 7, 2011, 9:30pm

I don’t think we are in disagreement. But what exactly end-to-end means will depend very much on the use case. Let us assume as an example using a matrix multiply. Let us further assume the kernel execution for a given size matrix takes 5 ms, and copying the data either host->device or device->host takes 2 ms. The same-size matrix multiply on the CPU takes 100 ms. What speedup should be reported for the matrix multiply?

Use case 1: The matrix multiply is called once, so there is no chance of keeping data resident on the GPU, nor of overlapping copies and kernel. That is, the cost of the matrix multiply end-to-end is 2+5+2=9 ms, thus end-to-end speedup is 100/9 = 11.1x

Use case 2: The matrix multiply is called five times, and by using async streams, copies are overlapped with the kernel execution, such that the device->host copy for the result of the previous kernel overlaps with the execution of the current kernel, as well as the host->device copy of the succeeding kernel. Maybe the overlap isn’t 100% perfect (there are always a few friction losses), and the five matrix multiplies execute, end-to-end, in 2+26+2 = 30 ms, so the end-to-end speedup versus the CPU is 500/30 = 16.7x

Use case 3: The matrix multiply is called 50 times, and the data consumed and produced is entirely resident in the GPU during the computation (e.g. we are doing a naive matrix exponentiation), and data is copied between host and device only at the start and end of the sequence, so total time is 2+50*5+2 = 254 ms, resulting in an end-to-end speedup of 5000/254 = 19.7x

Arakageeta · January 8, 2011, 4:14am

I agree.

Let me rephrase my question:

The CUDA Showcase has advertised speed-ups beside many of the published papers. Is it safe to assume, more often than not, that these are end-to-end speed-ups? I think this question could be answered by any of the following: yes; no; I don’t know; I think so; probably not; etc.