Simple proven (timed) example code where GPU beats CPU, anyone?

Asuwn · October 30, 2013, 4:55pm

Reading through CUDA programming guide here and playing with examples, I decided to time the execution of kernels vs. CPU loops… Not doing anything fancy, just a simple kernel that performs a map-type operation .

To my surprise, GPU is orders of magnitude slower to complete the task than CPU. I thought maybe I am doing something wrong, so I went to CUDA samples and timed them - same result.

GPU = GeForce GTX 780M (1536 CUDA cores), CPU = i7-4900MQ, CUDA toolkit version = 5.5

I am timing using cudaEvent* functions.

Played with grid size, block size and overall data size - no improvements, even on large datasets. Played with Debug/Release configurations - no difference.

I fully understand the penalties and bottlenecks of host<->device data transfers and especially the time penalties of device memory allocations, so I decided to also time just the kernel execution alone for the sake of the experiment purity. Result: even empty CUDA kernel takes longer to execute than a simple non-empty CPU loop.

I am eager to figure out the problem myself, so I am not asking to fix my code. But can anyone please post an example of a program that performs the same computation on GPU and CPU, times both and comes up with faster execution speed on GPU? This way I will know it is feasible in principle and will work on my code eventually achieving the same. It doesn’t matter what the computation is, as long as it proves the concept that GPU can compute faster than CPU in certain circumstances. I just need to figure out what these circumstances are.

Fiepchen · October 30, 2013, 5:15pm

A simple memcopykernel should work, since its DRAM-bandwidth bound and GPU-Bandwidth > CPU-Bandwidth:

global void MemCopyKernel(float4* In, float4* Out)
{
int ID = threadIdx.x + blockDim.x*blockIdx.x;
Out[ID] = In[ID];
}

Copy several 100 MBytes with this and your GPU will certainly be faster.

CudaaduC · October 30, 2013, 5:29pm

I have over a dozen timed CPU vs GPU examples on my GitHub page;

[url]OlegKonings (Oleg Konings) · GitHub

If you are using Visual Studio make sure the -G flag is NOT set. It is the default setting for VS, and needs to be changed. Makes a huge difference in running time.

CudaaduC · October 30, 2013, 6:12pm

One of the best ways to compare is to use thrust::sort() and compare to STL::sort().

Try generating 500 million random 32 bit floats and sort on CPU then GPU. Use the device pointers to GPU memory rather than thrust::vector, as it is faster.

Also keep in mind that the 780 will no be great for 64 bit numbers, as it is primarily for PC gaming.

Asuwn · October 30, 2013, 7:32pm

Yes, not planning to use double precision, was actually playing with 32-bit integers and simple 1-to-1 map kernel and got this confusing result.

Asuwn · October 30, 2013, 7:38pm

Thanks, just tried your CUDA_Matrix_Pow project and got what you would normally expect - 8.3 s on CPU, 0.2 s on GPU. This makes sense. Many thanks, I will be now digging into your example and trying to make my code do similar stuff.
On -G: I checked, it is only present in Debug builds, and yet both Debug and Release builds in my case are horrendously slow on GPU side. So -G alone can not explain the strange effect that I am seeing.

Either way, your CUDA_Matrix_Pow serves as proof of concept, this is exactly what I needed. Doesn’t solve my problem yet, but at least gives me a good example to strive to and the assurance that it is physically possible and that nothing is wrong with my instance of hardware. Many thanks!

I will let others know what was the culprit once I figure it out!

Asuwn · November 1, 2013, 4:42am

OK, as promised, I am reporting on the culprit. I found what was wrong with my code. Actually, nothing was wrong with the code itself, but one thing was wrong with how I was timing the CPU code. I used CUDA’s cudaEventRecord() and cudaEventElapsedTime() functions. They work great when timing the GPU activities, but apparently underreport the time when there is no GPU activities in between the start and stop timers (even when the GPU is selected). I thought that this is me not doing enough RTFM, so I went back to TFM and checked - no, the manual does not mention this.

Anyway, it is solved now - I now use the good old multimedia timer for timing both CPU and GPU code just like the user CudaaduC is doing in his code and measurement results are now making sense - GPU is faster than CPU like it should be. Many thanks to all for help, it is much appreciated!

Topic		Replies	Views
faster at small runtimes, slower for larger runtimes CUDA Programming and Performance	1	784	June 4, 2010
CUDA slower than CPU? CUDA Programming and Performance	7	978	August 18, 2023
problem in timing of GPU work CUDA Programming and Performance	5	897	September 11, 2015
Comparison of execution time in CPU and GPU is the CPU better than GPU in execution time ??? CUDA Programming and Performance	6	10626	September 17, 2010
Program without CUDA is faster CUDA Programming and Performance	6	10547	December 19, 2008
CUDA trouble CUDA Programming and Performance	3	1036	March 19, 2013
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6660	February 19, 2009
Starter Question Gpu exec time vs Cpu exec time CUDA Programming and Performance	1	3213	February 16, 2012
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4910	April 23, 2008
Is CUDA really that fast? CUDA Programming and Performance	17	11923	September 21, 2009

Simple proven (timed) example code where GPU beats CPU, anyone?

Related topics