Cuda works slower then CPU

Hello , I tried to implement the the program as follows:

void vectorAddWrappper(int blocksPerGrid,int threadsPerBlock,char*d_A, char*d_B,int numElements);

float add_num(float A,float B)
    float *d_A,*d_B;
    cudaMallocManaged((void **)&d_A, sizeof(A));
    cudaMallocManaged((void **)&d_B, sizeof(B));
    float res=*d_A;
    return res;

__global__ void
vectorAdd(float *A,float *B, int numElements)
    *A= *A + *B;

void vectorAddWrappper(int blocksPerGrid,int threadsPerBlock,float*d_A, float*d_B,int numElements){
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B,numElements);

But the CPU execution is faster then the GPU one .
I am on Nvidia Tesla k80.

Here is the result from nvprof

Do suggest me what can I do to make it run faster than the CPU.

You can’t expect memory transfer to GPU + performing one addition per data element + memory transfer to the CPU to be faster then directly accessing the memory on the CPU to make one addition.

The transfers go over the PCIe bus which is quite a bit slower than the bandwidth available to the CPU when directly accessing the data.

You have to compute much more in relation to what you transfer - then the GPU can win with a comfortable margin. That typically means accessing the same data elements many times such as is necessary in e.g. a large matrix multiplication. The larger the matrix gets, the more advantage the GPU has over the CPU.