Program without CUDA is faster

I made program in C that calls a function that will do something in the GPU.
It is working. So, now I’m trying to test if it is really faster with CPU and GPU working on it.
So, I made the same program that does the same but CPU only works in it. (I mean i dont have any cuda codes here)
I put a timer on both the programs… I used clock() and subtract the final time with the initial time.

the program has 2 arrays
and it does multiplies the 2 consecutive elements in the array just like below…and them it to another array…

from —> a[0,1,2,3,4,5,6,7]
result ----> b[0,0,6,6,20,20,42,42]

i tried the different sizes of the array up to 700…
I’m wondering why the program without CUDA is faster…

Sounds like you’re spending more time copying the arrays over and back than actually computing. It probably takes the CPU just as much time to send the array to the GPU as it would to just do the computation.

CUDA is just a tool.

If you doesn’t optimize the CUDA code, and(/or) your functions is very small. I don’t think that CUDA more faster than C code.

Could you show us your kernel code and your kernel call in host code?

Also, 700 elements is not nearly enough to saturate the GPU.

It also depends on what GPU card are you using…

the biggest problem of Cuda is copy memory from host to device and vice versa. So you need run with bigger arrays. also is important to optimize like Quoc Vinh said. where is your example code?

There is no way this is going to be faster on CUDA regardless of what you do. This computation is memory bound, you have to 2 memory loads, one memory store, and one multiply instruction for each element in the array. The time it takes to do the memory loads and store is going to be ~100-1000x the time it takes to do the multiply.

Using CUDA, you not only have to load the data from memory (making the best case time equal to that of the CPU assuming the GPU is infinitely fast), you also have to send it over PCIe, which is ~10-100x slower than operations from main memory.

You need to be doing more computation per data element than just a multiply.