In SDK project the GPU function takes more time than CPU function

In the SDK project,there is one sample project called cppIntegration.I have calculated the time taken by the kernel function and normal cpp function using this method

unsigned int timer = 0;
kernel<<< grid, threads >>>((int*) d_data);
kernel2<<< grid, threads2 >>>(d_data_int2);
printf(“Processing time:GPU %f (ms) \n”, cutGetTimerValue(timer));

These two kernel function takes approx (0.151276 ms) to execute and

unsigned int timer1 = 0;
computeGold(reference, data, len);

computeGold2(reference2, data_int2, len);
printf(“Processing time:CPU %f (ms) \n”, cutGetTimerValue(timer1));

These two cpp function takes approx (0.002724 ms) to execute

My Graphics card version is this:-

Card name: NVIDIA GeForce 8400 GS
Manufacturer: NVIDIA
Chip type: GeForce 8400 GS
DAC type: Integrated RAMDAC
Device Key: Enum\PCI\VEN_10DE&DEV_0404&SUBSYS_00000000&REV_A1
Display Memory: 1906 MB
Dedicated Memory: 499 MB
Shared Memory: 1406 MB

Can anyone Please help me with this and suggets why the time taken by GPU is more as compared to CPU when the case should be reversed?

Thanks in advance!!!

Not every example in SDK must be computed faster on GPU. ;-) In this case, the amount of processed data is very low (array of 16 elements) so even with GeForce GTX 285 the kernel computation is slower on GPU then on CPU. I am not sure about the particular reason, maybe it’s because kernel call itself is more time consuming than (small) kernel computation. Maybe someone other can give a better and deeper explanation.

Thanks for the reply paul,but i just wanna know where all can i use GPU instead of CPU…I have lot of confusion in this?

Data-parallel applications are GPU friendly.

Operations on Huge vectors, arrays etc… are most likely data-parallel - They can be moved to GPU…

Dtto Sarnath. Use GPU whenever you process very large vectors or matrices, especially when you need to perform algorithms that have complexity worse than O(n), e. g. matrix multiplication. The cppIntegration has complexity O(n) and is performed on 16 elements vector – there the GPU is useless because even moving data from CPU to GPU and kernel initialization costs much more time than computation on CPU.

Thanks a lot Paul and Sarnath for ur help!!!

I have one more question regarding this.I have a small data set say an array of 100 but inside that FOR loop there are like 50 more calculations and lots of arithematic operation like + , - , / , *,find square roots etc…So is This OK to port these kinds of code to CUDA GPU ?

Thanks in Advanvce

If you choose to expose the parallelism in terms of 100 elements – then you have just 100 threads — i.e. pretty less number to do on GPU.

If you choose to expose the parallelism present in the FOR loop along with the 100 elements – then you have FOR_LOOP_ITERATIONS*100 – which could be a good number for the GPU.

The main criterion for parallelizing FOR loops is that there must NOT be inter-dependence between iterations. If an iteration depends on results of previous iterations - they cannot be expressed as parallel threads…

You may want to think about this and choose the parallel perspective that will feed the GPU elephant correctly!

i am running into this similar issue here, i have several arrays (6) but they are all 3x3 arrays, and so computation on the GPU seem to be VERY slow compared to CPU. The results come out matching the CPU but much much slower.

Am I better off not using CUDA here? Or are there other techniques for me to speed up CUDA to the point that it can outperform the CPU?

Please Help!