Is this speedup too good to be true?

Hello, i am working on comparing the run time for an algorithm running on a GeForce GTX 580 GPU compared to a sequential program running on a Core i7 CPU.

The test algorithm is solving the linear advection equation in one dimension using a Lax-Friedrichs scheme, so essentially what is described under “illustration” here: http://en.wikipedia.org/wiki/Lax–Friedrichs_method

The core of the algorithm is essentially this:
(N is the number of grid points in the discretization)

while (time < end_time):
___for i = 1 to N:
______
___time += dt

In the CUDA version the inner loop is replaced by a kernel call. The outer loop is not parallelized since it has to performed in order.
With a constant end_time the length of the outer loop increases with N, since dt must decrease linearly with dx for stability.

I tried comparing the sequential run time to the CUDA run time (speedup=sequential_time/cuda_time) while varying both system size N and length of simulation end_time. I got what is shown in the attached file.

I found several strange things about this:

  1. The speedup factor seems to increase with N indefinitely. I would think that when N rises above the effective “number of cores” of the GPU, the speedup stabilizes on a constant number since further added grid cells cannot be computed in parallel with everything else anyway, but must wait for its turn on the GPU. This is the behaviour i get when multi-core parallelizing it on the Core i7 at least: A pretty stable 3x speedup independent of N (unless N is very small).

The speedup factor seems, at large enough end_time, to decrease with increasing end_time when N is held constant. How can this be?

  1. The speedup factor seems in general very high. In this case it goes as high as 300, and from the plot it would seem that it would continue to increase if i tried higher N.

So:
Can anything explain why the speedup factor keeps increasing with N? Or why it starts decreasing with increasing end_time (when it is high enough)?
Can the speedup factor really be this high? Others have reported much more modest gains.

OR: Must i have messed up somewhere?

EDIT: I have compared the plotted outputs of the program run on the GPU and the CPU, and the results always look identical.
compare_both_adv.pdf (13.5 KB)

If you are seeing a speedup of 300x, there is probably a timing problem. Where are you putting your timing checks?

I am simply timing the entire program from the moment the binary is run to the time it returns. I’ve done this both with “time.time()-start_time” in a calling Python script, and simply a manual “time ./binfile” in the terminal and reading of the “real” (i.e. wall time) for comparison. In both cases i get a speedup in the hundreds if N and end_time is high enough.

I have also plotted the output immediately after the CUDA version returns, and the data looks fine. So it doesn’t seem like the program returns before the work is done or anything like that…

300X is indeed on the high end of expected speedup, but not unreasonable, since the code is simple and it’s likely to be ALU-bound.

I would try to rule out problems with optimizations in the CPU version. E.g. if the division dt/dx in the Wikipedia link is not optimized out, that will substantially hinder performance.

I’d also think about this in terms of floating point operations per second. Plot effective # of flops achieved in both cases. See if those numbers make sense. You should be getting on the order of 3G flops per core on the CPU and 800G flops on the GPU. The big unknown is the number of operations per kernel, which can vary a lot depending on the situation. For example, if a and dt/dx are know at compile time, the transformation in Wikipedia could be compiled into as little as three instructions (a multiplication, a fused multiply-add, and a store). If dx is a runtime argument, it will be much longer. You’d have to look at the assembly code to see what’s happening.

Both dx and dt are decided at runtime. But they are constant, so the number a*dt/dx is only computed once in the beginning and passed around as a constant.

But even if we can explain the 300x, how can we explain the fact that it seems to be ever increasing? I did a long run tonight with even higher N, and i got a speedup factor as high as 450x. When plotting runtime VS N for the CPU and the GPU in a loglog-plot, i see that both have the approximate N^2 dependence as expected, but it seems that the GPU runtime increases juust slightly less with N.

EDIT: It could appear i’ve managed to write such horrible CPU code that it scales worse than N^2 for this problem… I can’t see how though.

Oh and why do you think the code is ALU bound? I mean for every u_(i+1) etc in the formula there is a global memory access. I also get approximately the same performance for single and double precision. (Single is a little better, but not 2x as one would expect for computationally bound (?)). I always took this to mean that the bottleneck was memory access.

On a Geforce card, a single/double throughput ratio of 2 would indicate a memory bound problem. Compute bound problems would have a ration of 8 (for compute capability 2.0) or 12 (for 2.1).

Are you suggesting that both CPU and GPU version increase as N[sup]2[/sup], but their ratio increases linearly with N? How would that be possible?

Is N also the number of threads used on the GPU, or do you produce multiple outputs per thread?

I though the double-precision performance was half of the single-precision performance in Fermi cards (which mine is one of), and one-eight for older cards. Anyway, no i got a difference factor of less than 2 between single and double, but expected 2.

N is the system size, and thus the length of the inner for loop in the sequential code (which is replaced by a kernel call in the CUDA code). So yes i guess it is equal to the number of threads scheduled. For quite large N they’ll have to queue up though, and N is not equal to the number of concurrent threads. The reason i expect it to scale as N^2 is because the length of the outer time-loop also scales with N.

What i’m finding now is that the run-time of the GPU scales exactly as N^2 as expected. The sequential code scales very strangely however… For low N in behaves as expected (N^2), but suddenly at large enough N it starts scaling exactly as N^4…!

This is why the speedup factor blows up, but i have NO idea how the sequential code starts scaling as N^4…

EDIT: And the specific N where the change happens depends on end_time (i.e. how long the outer loop runs for)…

Because the scale of your problem is such that all your data is expected to fit inside L1/L2 caches. It will become memory bound if N>10^6.

The caches on the GPU are really that high?

Anyway, how come i get the same performance with double and single precision then if the computation is the bottleneck? (The same program compiled with -arch sm_20, and with no -arch option. The latter case gives the “demoting to float” warning. The former does not.)

No, but the CPU caches are. So you will see a large increase in runtime once the caches overflow on the CPU, which might somehow look like N^4 scaling if you don’t have enough data points for very large computations.

Which doesn’t explain the initial finding of unlikely high GPU vs. CPU speedup though. Coming back to seibert’s question: Where do you store the results of the program run, so you can plot them after the program finished?

But if the memory latency suddenly increases, the total runtime should still scale as N^2 shouldn’t it? I would expect a jump in the factor “a” in a*N^2, but not a change in the scaling power.

I would say this is very connected with the very high speedup factor of GPU vs CPU, since the GPU runtime scales as N^2 as far as i have checked. If the CPU runtime by som error scales as N^4 then of course the speedup factor becomes artificially high.

Where do i store the results? On the hard drive, if that’s what you’re asking… The writing to disk takes no time at all though, compared to the computational loops.