Simple proven (timed) example code where GPU beats CPU, anyone?

Reading through CUDA programming guide here and playing with examples, I decided to time the execution of kernels vs. CPU loops… Not doing anything fancy, just a simple kernel that performs a map-type operation .

To my surprise, GPU is orders of magnitude slower to complete the task than CPU. I thought maybe I am doing something wrong, so I went to CUDA samples and timed them - same result.

GPU = GeForce GTX 780M (1536 CUDA cores), CPU = i7-4900MQ, CUDA toolkit version = 5.5

I am timing using cudaEvent* functions.

Played with grid size, block size and overall data size - no improvements, even on large datasets. Played with Debug/Release configurations - no difference.

I fully understand the penalties and bottlenecks of host<->device data transfers and especially the time penalties of device memory allocations, so I decided to also time just the kernel execution alone for the sake of the experiment purity. Result: even empty CUDA kernel takes longer to execute than a simple non-empty CPU loop.

I am eager to figure out the problem myself, so I am not asking to fix my code. But can anyone please post an example of a program that performs the same computation on GPU and CPU, times both and comes up with faster execution speed on GPU? This way I will know it is feasible in principle and will work on my code eventually achieving the same. It doesn’t matter what the computation is, as long as it proves the concept that GPU can compute faster than CPU in certain circumstances. I just need to figure out what these circumstances are.

A simple memcopykernel should work, since its DRAM-bandwidth bound and GPU-Bandwidth > CPU-Bandwidth:

global void MemCopyKernel(float4* In, float4* Out)
{
int ID = threadIdx.x + blockDim.x*blockIdx.x;
Out[ID] = In[ID];
}

Copy several 100 MBytes with this and your GPU will certainly be faster.

I have over a dozen timed CPU vs GPU examples on my GitHub page;

https://github.com/OlegKonings?tab=repositories

If you are using Visual Studio make sure the -G flag is NOT set. It is the default setting for VS, and needs to be changed. Makes a huge difference in running time.

One of the best ways to compare is to use thrust::sort() and compare to STL::sort().

Try generating 500 million random 32 bit floats and sort on CPU then GPU. Use the device pointers to GPU memory rather than thrust::vector, as it is faster.

Also keep in mind that the 780 will no be great for 64 bit numbers, as it is primarily for PC gaming.

Yes, not planning to use double precision, was actually playing with 32-bit integers and simple 1-to-1 map kernel and got this confusing result.

  1. Thanks, just tried your CUDA_Matrix_Pow project and got what you would normally expect - 8.3 s on CPU, 0.2 s on GPU. This makes sense. Many thanks, I will be now digging into your example and trying to make my code do similar stuff.

  2. On -G: I checked, it is only present in Debug builds, and yet both Debug and Release builds in my case are horrendously slow on GPU side. So -G alone can not explain the strange effect that I am seeing.

Either way, your CUDA_Matrix_Pow serves as proof of concept, this is exactly what I needed. Doesn’t solve my problem yet, but at least gives me a good example to strive to and the assurance that it is physically possible and that nothing is wrong with my instance of hardware. Many thanks!

I will let others know what was the culprit once I figure it out!

OK, as promised, I am reporting on the culprit. I found what was wrong with my code. Actually, nothing was wrong with the code itself, but one thing was wrong with how I was timing the CPU code. I used CUDA’s cudaEventRecord() and cudaEventElapsedTime() functions. They work great when timing the GPU activities, but apparently underreport the time when there is no GPU activities in between the start and stop timers (even when the GPU is selected). I thought that this is me not doing enough RTFM, so I went back to TFM and checked - no, the manual does not mention this.

Anyway, it is solved now - I now use the good old multimedia timer for timing both CPU and GPU code just like the user CudaaduC is doing in his code and measurement results are now making sense - GPU is faster than CPU like it should be. Many thanks to all for help, it is much appreciated!