Reading through CUDA programming guide here and playing with examples, I decided to time the execution of kernels vs. CPU loops… Not doing anything fancy, just a simple kernel that performs a map-type operation .
To my surprise, GPU is orders of magnitude slower to complete the task than CPU. I thought maybe I am doing something wrong, so I went to CUDA samples and timed them - same result.
GPU = GeForce GTX 780M (1536 CUDA cores), CPU = i7-4900MQ, CUDA toolkit version = 5.5
I am timing using cudaEvent* functions.
Played with grid size, block size and overall data size - no improvements, even on large datasets. Played with Debug/Release configurations - no difference.
I fully understand the penalties and bottlenecks of host<->device data transfers and especially the time penalties of device memory allocations, so I decided to also time just the kernel execution alone for the sake of the experiment purity. Result: even empty CUDA kernel takes longer to execute than a simple non-empty CPU loop.
I am eager to figure out the problem myself, so I am not asking to fix my code. But can anyone please post an example of a program that performs the same computation on GPU and CPU, times both and comes up with faster execution speed on GPU? This way I will know it is feasible in principle and will work on my code eventually achieving the same. It doesn’t matter what the computation is, as long as it proves the concept that GPU can compute faster than CPU in certain circumstances. I just need to figure out what these circumstances are.