I try to solve a problem (sum of two vectors) with gpu nvidia with 32 cores but i see that
the time which do is more larger than by CPU with one core. The result is right but the speed is very low.
What occur here? Each time i copy my data by host to device with command cudaMemcpy.
Is there any case to have problem with copy of data and code is slow?
Vector addition is not a good example for GPU computing, as the arithmetic intensity is just too low. Per every floating point operation you have to transfer 12 bytes over the PCIe bus, which will always be the bottleneck. As the PCIe bus can never be faster than main memory bandwidth, the GPU can never be faster than the CPU.
I try to sum 2 vectors of 1.000.000 integers.
Each time i load 32 (32 cores) elements and do the sum.
I do not know the size of block. How can i learn about this?
At kernel cal i put 1 block and 32 threads.
Also, can you send me some simple examples of CUDA code?
Your number of threads and blocks is far too small. You want to have at least as many blocks as there are multiprocessors, and at the very least 24 (for compute capability 1.x) or 18 (for compute capability 2.x) times as many threads as there are cores on your GPU.
Still, for vector addition you will not be able to reach the speed of the CPU.