Sequential code is faster than parallel, how is it possible?

Hello everyone!
I wrote the classic example of the sum of two vectors with the kernel function, using the id of the thread and of the blocks.
Arrays have a million elements, but in parallel with CUDA takes 0.33 seconds, sequentially on the CPU instead only 0.01 second … I made sure not to use too many blocks and distributing fine threads, so what is the cause? overhead for synchronizations?

  1. CUDA has a start-up overhead. For “small” problems like this one, the startup overhead will outweigh any gains from using the GPU.

  2. In spite of what you may think, vector add is a bandwidth-bound problem, not compute-bound. Bandwidth-bound problems that have no data reuse (vector add doesn’t) generally, by themselves, won’t be good/interesting candidates for GPU acceleration. Although the GPU bandwidth bound code will run faster than the CPU code (if you carefully time the operations, removing start-up overhead, etc.), the cost to move the data to/from the GPU will require at least one read of the inputs from the CPU and one write of the outputs to the CPU. Since this one read and one write will be incurred by the CPU code also (which will also be bandwidth bound), even if the GPU computes the result in zero time, it will not represent a speed-up over the CPU version - since the read of inputs and write of outputs (to CPU memory) is required in either case.

So in practice, nobody does a single vector add on a GPU and calls it a day. The objective is to move much of your workload to the GPU, and take advantage of running significant sections of your code (perhaps most or all of it) in the accelerated setting.

The CUDA vectorAdd sample code:

is not offered to developers as a great example of GPU acceleration (try a large matrix-matrix multiply using CUBLAS for that), but instead is offered as a very simple “template” application showing the basic steps necessary to perform computation on the GPU, in a way that developers can readily understand.

This type of question gets asked regularly. You can find other examples of it and explanations if you do some searching.

1 Like

It has been a while since there was a serious rebuttal to the “GPUs suck because it takes so long to get the results to and from the GPU”.

Sure, there are vast numbers of excellent multi-core CPU-powered solutions out there but it’s probably time to bury the myth that building your system around GPUs is potentially a mistake. [1] [2] […]

What has changed in the past 5 years is that GPU developers have gone from writing relatively isolated kernels to developing sophisticated systems that are mostly GPU-resident and tolerant of low bandwidth PCIe connections to the host.

System hardware changes but engineers also mature in their understanding of its potential.

NVIDIA should create an updated survey of modern GPU-powered products along with some simple marketing-friendly metrics that reveal the power and economy of sophisticated GPU-driven solutions.

On second thought, I’m sure NVIDIA already makes these arguments on a daily basis so perhaps adding a system level “architecture hints” chapter to the CUDA docs would be more beneficial to GPU developers.

One example of a survey that is updated regularly:

Once upon a time that survey used to include “marketing-friendly metrics” that offered an expected or typical speed-up in many cases. That has been removed, I think it was difficult to accurately maintain.

Another good example of an unstructured “updated survey” can be found at the parallelforall blog:

I regularly find gems there that I had missed or have not read yet.

Post your example source code please and details of the hardware used (CPU/GPU).

I would claim that the argument “GPUs suck because it takes so long to get the results to and from the GPU” has been dead since at least the introduction of PCIe gen3, if not earlier (unless one interacts with people who really have no clue about GPU-accelerated applications, maybe). As I recall, when PCIe gen 3 was first introduced, full-duplex operation would routinely max out the system memory performance of commonly used hardware.

In addition to the GPU-application showcase, NVIDIA regularly points out GPU acceleration successes via their blog, and I am always surprised how many different application areas can make good use of GPUs by tossing out conventional-wisdom approaches. As you say, at this point many successful applications have been re-engineered to be pretty much resident on the GPU, with negligible impact from PCIe. The apps that have not been re-engineered perform with (very) low efficiency on both GPUs and modern CPUs (surprise, surprise). Time to ditch the dusty-deck codes dating to the 1970s and 1980s!

A couple of other issues which may trip up people new to CUDA;

1.) If using Visual Studio make sure that you are compiling in release rather than debug. For some reason Visual Studio’s default setting for CUDA is debug (-G) and this flag makes a massive difference in performance. Also make sure you are compiling for the highest architecture supported by your card.

2.) When copying over data (and back) from CPU to GPU allocate host memory as ‘pinned’ or ‘page locked’ rather than ‘pageable’. You do this on the host via cudaHostAlloc(), and this will effectively almost double the performance of memory transfers.

3.) An vector add of a small array (1 million) is really not a good example of an application which has a large performance difference when compared to a CPU. Applications such as image processing/reconstruction, brute force, dense linear algebra sub-routines and sorting are better examples.

1 Like

I think txbob hit the nail on the head in his OP, vector addition is definitely bandwidth-bound and not compute-bound. Don’t forget, CPUs are fast. Very fast. Especially when you can write a dumb loop that achieves like 99.99% cache coherency. You can make it even faster if you use SSE instructions as well.

You’ll notice a GPU speed-up when your memory read/compute ratio begins to fall heavy on the compute side.

And yes, definitely allocate pinned host memory.

Actually, forget that. You should be using Thrust over raw CUDA stuff wherever possible. It’s like choosing C-style stuff over C++ constructs, just plain yucky.

Heartfelt thanks to all, I’m just getting into this world and you have been helpful to me!