Vector addition is not a good example for GPU computing, as the arithmetic intensity is just too low. Per every floating point operation you have to transfer 12 bytes over the PCIe bus, which will always be the bottleneck. As the PCIe bus can never be faster than main memory bandwidth, the GPU can never be faster than the CPU.
I try to sum 2 vectors of 1.000.000 integers.
Each time i load 32 (32 cores) elements and do the sum.
I do not know the size of block. How can i learn about this?
At kernel cal i put 1 block and 32 threads.
Also, can you send me some simple examples of CUDA code?
Your number of threads and blocks is far too small. You want to have at least as many blocks as there are multiprocessors, and at the very least 24 (for compute capability 1.x) or 18 (for compute capability 2.x) times as many threads as there are cores on your GPU.
Still, for vector addition you will not be able to reach the speed of the CPU.