slow speed of cuda code

I try to solve a problem (sum of two vectors) with gpu nvidia with 32 cores but i see that
the time which do is more larger than by CPU with one core. The result is right but the speed is very low.

What occur here? Each time i copy my data by host to device with command cudaMemcpy.
Is there any case to have problem with copy of data and code is slow?

A couple of problems could be going on here.

  1. What is your block size? For a vector add I’d have it 512.

  2. How big are the vectors? To account for the PCI-E time they would have to be pretty large. You should time without the PCI-E time.

Vector addition is not a good example for GPU computing, as the arithmetic intensity is just too low. Per every floating point operation you have to transfer 12 bytes over the PCIe bus, which will always be the bottleneck. As the PCIe bus can never be faster than main memory bandwidth, the GPU can never be faster than the CPU.

I am new on cuda. 2 days only!

I try to sum 2 vectors of 1.000.000 integers.
Each time i load 32 (32 cores) elements and do the sum.
I do not know the size of block. How can i learn about this?
At kernel cal i put 1 block and 32 threads.

Also, can you send me some simple examples of CUDA code?

Loads of examples:

Your number of threads and blocks is far too small. You want to have at least as many blocks as there are multiprocessors, and at the very least 24 (for compute capability 1.x) or 18 (for compute capability 2.x) times as many threads as there are cores on your GPU.

Still, for vector addition you will not be able to reach the speed of the CPU.