Hello, i am working on comparing the run time for an algorithm running on a GeForce GTX 580 GPU compared to a sequential program running on a Core i7 CPU.
The test algorithm is solving the linear advection equation in one dimension using a Lax-Friedrichs scheme, so essentially what is described under “illustration” here: http://en.wikipedia.org/wiki/Lax–Friedrichs_method
The core of the algorithm is essentially this:
(N is the number of grid points in the discretization)
while (time < end_time):
___for i = 1 to N:
___time += dt
In the CUDA version the inner loop is replaced by a kernel call. The outer loop is not parallelized since it has to performed in order.
With a constant end_time the length of the outer loop increases with N, since dt must decrease linearly with dx for stability.
I tried comparing the sequential run time to the CUDA run time (speedup=sequential_time/cuda_time) while varying both system size N and length of simulation end_time. I got what is shown in the attached file.
I found several strange things about this:
- The speedup factor seems to increase with N indefinitely. I would think that when N rises above the effective “number of cores” of the GPU, the speedup stabilizes on a constant number since further added grid cells cannot be computed in parallel with everything else anyway, but must wait for its turn on the GPU. This is the behaviour i get when multi-core parallelizing it on the Core i7 at least: A pretty stable 3x speedup independent of N (unless N is very small).
The speedup factor seems, at large enough end_time, to decrease with increasing end_time when N is held constant. How can this be?
- The speedup factor seems in general very high. In this case it goes as high as 300, and from the plot it would seem that it would continue to increase if i tried higher N.
Can anything explain why the speedup factor keeps increasing with N? Or why it starts decreasing with increasing end_time (when it is high enough)?
Can the speedup factor really be this high? Others have reported much more modest gains.
OR: Must i have messed up somewhere?
EDIT: I have compared the plotted outputs of the program run on the GPU and the CPU, and the results always look identical.
compare_both_adv.pdf (13.5 KB)