Is CUDA really that fast?

David_Lisin · September 16, 2009, 4:50pm

Good afternoon,
I am currently developing with Matlab and CUDA, and am not getting the expected results. I have observed that in many of the CUDA SDK examples, they say taht the calculations have been accelerated 10 times or more. But when you really time the whole execution, you find that the actual improvement is MUCH less (x1,04), but they only time kernel execution times, and not the whole process, as thay dont calculate the time spent on memory allocation and on memory transfers to the GPU, simply exusing it as “Warming up the GPU”.
My total time is improved from 400 seconds to 140. When lloking at the profiler, i observe that actual execution time is only 17% of this time, whilst the rest is used on cudaMemCpy [url=“http://forums.nvidia.com/index.php?showtopic=104504&st=0&p=577725&#entry577725”]http://forums.nvidia.com/index.php?showtop...mp;#entry577725[/url] .
So my question is: When people state that calculation time has improved in x40(just to say a number), do they refer to:
Before it took 400 seconds, now it takes 17%of 140 = 23 seconds, or do they really refer to: if it took them 400 seconds before, now it only takes 10 seconds.
Please, any honest reply would be greatly apreciated.
Thanks in advance, David

MisterAnderson42 · September 16, 2009, 5:25pm

It is unfortunately impossible to make a true statement about what “people” state in general. Sometimes they mean one thing and sometimes they mean the other, each case must be evaluated in context.

SPWorley · September 16, 2009, 5:28pm

A speedup number by itself is meaningless unless you explain what you’re measuring. This isn’t a CUDA or CPU issue, it’s just science. Always define your terms.

niko084 · September 16, 2009, 5:31pm

It would also matter greatly upon what the machine is and what card(s) you are using for Cuda…

For instance, if you were doing heavy work on a 1.8ghz celeron, and using 2 GTX285’s… You are capable of getting a LOT more performance increase than say a person running a i7 @ 4ghz who is using a single 9400GT.

_Big_Mac · September 16, 2009, 5:45pm

It depends. Some people count only kernel times (sometimes even only selected kernel times), others are honest and give the whole program time or anything in between. I usually measure execution time including memcpys but excluding one-time initialization (on the ground that it’s done once, it’s usually less than a second anyway). Whether I include I/O depends on how often it’d be done in a real use-case (ex. once after the final results arrive or between every other kernel launch).

Example timings from my last app look like this:

problem size: 100

2-threads CPU time: 1465 ms

CUDA time: 12 ms

I/O: 255 ms

problem size: 1000

2-threads CPU time: 15 665 ms

CUDA time: 89 ms

I/O: 253 ms

You can see why I’d do this - I/O (writing the generated 1kx1k bitmap to disk) takes more time than the computations. In this app, I/O is a constant overhead and is done once per runtime.

Now, how we treat I/O here is debatable because for large problem size it’s neglible yet for smaller it makes a huge difference.

For size 100, speed-up with I/O is about 6x, without it’s 122x.

For size 1000, it’s 46x vs 176x.

If I don’t take the constant overhead into account, I get a more consistent speed-up that is more representative of how well this scales - that’s why it does make sense to give this apparently dishonest number. But one should have a certain use-case in mind and remember than memcopies also count in scalability. Knowing Amdahl’s law helps too. Personally, if I were to publish those results or something, I’d supply all three numbers and have the reader decide how to interpret them and what to do with I/O overhead.

So the bottom line is - some problems do get real-life 50-100x speed-ups, some don’t but are manipulated into giving such figures.

(“CUDA time” was kernel time + cudaMalloc time + cudaMemcpy time. There was no warm-up kernel or other trickery)

Tigga · September 16, 2009, 5:50pm

If I were to quote a performance figure it’d be whole program execution, including any memory transfers, however I don’t usually find it makes much of a difference. Both the CUDA applications I’ve worked on have had under 10% of the total GPU runtime spent with copying memory. My pattern tends to involve hundreds of kernel calls in between copies.

I guess it’d be quite different if you’re doing a single operation in matlab though - probs only one kernel call.

Keldor314 · September 16, 2009, 7:21pm

It’s a interesting problem to think about. Imagine for a moment that you had a GPU that was not just fast, but infinitely fast. We’ll say it can perform the entire program in a single clock cycle. Would this be infinitely faster, or even faster at all than the CPU implementation? Not necessarily. Even the infinitely fast GPU still needs to have data copied from the CPU and back, which may or may not be a significant time drain.

Let’s take the simple case of vector addition. On the CPU side, we simply read the values from memory, add them, and write them back. The memory is the bottleneck already - the CPU can do several additions in the time it takes to load two components to add once! Now let’s look at the GPU side. The GPU has much faster memory than the CPU, and vector addition is trivially parallelizable. Yet oddly enough, the total time is worse. Why? Simple - the PCIe bus is even slower than the CPU memory, so while the GPU can blaze through the actual computation, the cost of the PCIe transfer completely buries the gain.

Does that mean that the CPU is faster than the GPU for vector addition? Not necessarily. Suppose the data was already there on the GPU. Then the GPU would be much faster than the CPU, which would now be the one having to deal with a data transfer! But wait, the data probably wasn’t in any memory space to begin with - in fact, it was probably on the hard drive initially, and had to be read from there! Is it fair to put the hard drive read time into the comparison? What about the time spent to write it back out? The time the printer takes to print the results so you can read them?

The ultimate point is that we need to do all the work on the GPU - doing bits and pieces on the GPU and other bits and pieces on the CPU is generally horribly inefficient. Transferring data back and forth needs to be viewed like hard drive transfers - avoided if at all possible.

David_Lisin · September 17, 2009, 7:38am

Thanks for answering, I have converted a series of functions from matlab to cuda, and the time i am measuring is:

Time that matlab takes to execute those functions vs how long it takes cuda to execute the functions.

I use the matlab profiler to measure total execution times, and cuda visual profiler to see how the cuda file is generaly working.

Whe I reffer to speedup, i mean total times. Therefore, if matlab takes 600 seconds, and the call to the cuda mex-file takes 300, i have a x2 speedup. Simple as that. But this includes: memory allocation, memory transfer to and from the gpu, execution and free-ing memory.

Thanks in advance, David

David_Lisin · September 17, 2009, 7:39am

Good morning, the calculations are performed on a Pentium D 3,4 GHz with 1 GTX285. But The PCIe is only version 1, thus the bottleneck with cudamemcpy.

Thanks in advance, David

David_Lisin · September 17, 2009, 7:44am

It depends. Some people count only kernel times (sometimes even only selected kernel times), others are honest and give the whole program time or anything in between. I usually measure execution time including memcpys but excluding one-time initialization (on the ground that it’s done once, it’s usually less than a second anyway). Whether I include I/O depends on how often it’d be done in a real use-case (ex. once after the final results arrive or between every other kernel launch).

Example timings from my last app look like this:
problem size: 100

2-threads CPU time: 1465 ms

CUDA time: 12 ms

I/O: 255 ms

problem size: 1000

2-threads CPU time: 15 665 ms

CUDA time: 89 ms

I/O: 253 ms
You can see why I’d do this - I/O (writing the generated 1kx1k bitmap to disk) takes more time than the computations. In this app, I/O is a constant overhead and is done once per runtime.

Now, how we treat I/O here is debatable because for large problem size it’s neglible yet for smaller it makes a huge difference.

For size 100, speed-up with I/O is about 6x, without it’s 122x.

For size 1000, it’s 46x vs 176x.

If I don’t take the constant overhead into account, I get a more consistent speed-up that is more representative of how well this scales - that’s why it does make sense to give this apparently dishonest number. But one should have a certain use-case in mind and remember than memcopies also count in scalability. Knowing Amdahl’s law helps too. Personally, if I were to publish those results or something, I’d supply all three numbers and have the reader decide how to interpret them and what to do with I/O overhead.

So the bottom line is - some problems do get real-life 50-100x speed-ups, some don’t but are manipulated into giving such figures.

(“CUDA time” was kernel time + cudaMalloc time + cudaMemcpy time. There was no warm-up kernel or other trickery)

I just cant help getting the fealing that its cheating. You cant tell someone: the calculations are 50 times faster (because they only consider kernel execution times), and then once you are going to test the program in real life scenarios, it turns that the end result is only 5% better. It just feels dishonest. When I time the total amount of time used, I use matlabs profiler, so that I can time the WHOLE execution. This way I get an honest result.Thanks, David

David_Lisin · September 17, 2009, 7:47am

I know exactly what you mean, when I first started here, I observed that the difference between cudaMemCpy(hostToDevice) + 1 kernel execution + cudaMemCpy(devicetoHost) and cudaMemCpy(hostToDevice) + 1000 kernel executions + cudaMemCpy(devicetoHost) was minimal, and that there was a great improvement.

Unfortunately, I am currently converting certain hand made mex_files to cuda, which are invoqued only once, many times, but with different data. So I cant get the improvement desired.

Thanks in advance, David

David_Lisin · September 17, 2009, 7:50am

It’s a interesting problem to think about. Imagine for a moment that you had a GPU that was not just fast, but infinitely fast. We’ll say it can perform the entire program in a single clock cycle. Would this be infinitely faster, or even faster at all than the CPU implementation? Not necessarily. Even the infinitely fast GPU still needs to have data copied from the CPU and back, which may or may not be a significant time drain.

Let’s take the simple case of vector addition. On the CPU side, we simply read the values from memory, add them, and write them back. The memory is the bottleneck already - the CPU can do several additions in the time it takes to load two components to add once! Now let’s look at the GPU side. The GPU has much faster memory than the CPU, and vector addition is trivially parallelizable. Yet oddly enough, the total time is worse. Why? Simple - the PCIe bus is even slower than the CPU memory, so while the GPU can blaze through the actual computation, the cost of the PCIe transfer completely buries the gain.

Does that mean that the CPU is faster than the GPU for vector addition? Not necessarily. Suppose the data was already there on the GPU. Then the GPU would be much faster than the CPU, which would now be the one having to deal with a data transfer! But wait, the data probably wasn’t in any memory space to begin with - in fact, it was probably on the hard drive initially, and had to be read from there! Is it fair to put the hard drive read time into the comparison? What about the time spent to write it back out? The time the printer takes to print the results so you can read them?

The ultimate point is that we need to do all the work on the GPU - doing bits and pieces on the GPU and other bits and pieces on the CPU is generally horribly inefficient. Transferring data back and forth needs to be viewed like hard drive transfers - avoided if at all possible.

Unfortunately I call the cuda enabled mex file many times, but always with different data, so i cant avoid transfer of data to and from gpu.

I need to see about the possibility of changing the whole base problem, to make it more cuda parallel compatible.

Thanks anyway, and nice post, David Lisin

Tigga · September 17, 2009, 8:39am

One thing it may be possible for you to do (depends on the problems) is concurrent copy and execution using streams (Programming Guide 4.5.2.4). This could greatly accelerate your program.

_Big_Mac · September 17, 2009, 1:38pm

See, I do get 150x real speed-up for problem sizes above, say, one million :) The equation is (CudaTime + constIO)/(CPUTime + constIO). For big problems, where both CudaTime and CPUTime are >> than constIO, this changes to CudaTime/CPUTime. That’s why I said it depends on a use-case - will an end user be solving problems of the size one million or one hundred?

David_Lisin · September 17, 2009, 2:08pm

Hi, thanks for the answer! The problem is that i am working with very large matrixes (6000x65x65 double), and I generally work only in two of the three dimensions, that vary as follows (pseudo-code):

for (int i=0;i<65;i++){

  [matrixresultA(:,i:end),matrixResultB(:,i)]= cudamexfilefunction(matrixA(:,i-end;i), matrixB(:,i,i-1),.....,vector(i:end));

  while(abs(1-matrixresultA())>0.0001){

          do_stuff

  }

}

I cant copy all of the function parameters to the cuda file, because of lack of memory (each matrix is roughly 208Mb), and all kernels need the result of the previous. Also the while at the end makes me need to copy data back to the cpu to check for the abs()>0.001, so there is the problem. If i could copy all to GPU, perform 65 calls to the kernel, all would be fine and lovely, but currently, I call the function 65 times, with 65 HtoD,kernel and DtoH, killing all type of improvement.

Thanks in advance, David

David_Lisin · September 17, 2009, 2:17pm

Thanks Tigga, ill look into it, but i downt think ill have much luck due to memory issues as i work with large matrixes, but if i get any result, good or bad, ill post back. Again thanks to you and everyone who responds, David

YDD · September 17, 2009, 2:34pm

Welcome to Amdahl’s Law. I don’t think it’s dishonest to quote the raw CPU vs GPU computation times, provided you make it obvious that you are doing so (although one could argue that comparing MATLAB to CUDA is deeply unfair to the CPU…). You just then say that your new bottleneck is the issue of host<->device transfers, and that you’re looking into minimising these.

niko084 · September 21, 2009, 8:37pm

Being I don’t know really the ins and outs of how the programs use it, again it depends on a lot of various things.

It’s possible your cpu could be limiting your gpu’s power, maybe.

The Pci-E v1.0 is going to cut back on a GTX285 a bit, but not a lot.

Good rule of thumb is a well rounded machine is far superior to a slow dated machine with one high end new part.