Cuda works slower then CPU

Hello , I tried to implement the the program as follows:

test.cc

void vectorAddWrappper(int blocksPerGrid,int threadsPerBlock,char*d_A, char*d_B,int numElements);

float add_num(float A,float B)
{
    float *d_A,*d_B;
    cudaMallocManaged((void **)&d_A, sizeof(A));
    cudaMallocManaged((void **)&d_B, sizeof(B));
    *d_A=A;
    *d_B=B;
    vectorAddWrappper(1,1,d_A,d_B,1);
    cudaDeviceSynchronize();
    float res=*d_A;
    cudaFree(d_A);
    cudaFree(d_B);
    //cudaDeviceReset();
    return res;
}

kernel.cu

__global__ void
vectorAdd(float *A,float *B, int numElements)
{
    *A= *A + *B;
}

void vectorAddWrappper(int blocksPerGrid,int threadsPerBlock,float*d_A, float*d_B,int numElements){
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B,numElements);
}

But the CPU execution is faster then the GPU one .
I am on Nvidia Tesla k80.

Here is the result from nvprof
https://ibb.co/KGtyxCq

Do suggest me what can I do to make it run faster than the CPU.

You can’t expect memory transfer to GPU + performing one addition per data element + memory transfer to the CPU to be faster then directly accessing the memory on the CPU to make one addition.

The transfers go over the PCIe bus which is quite a bit slower than the bandwidth available to the CPU when directly accessing the data.

You have to compute much more in relation to what you transfer - then the GPU can win with a comfortable margin. That typically means accessing the same data elements many times such as is necessary in e.g. a large matrix multiplication. The larger the matrix gets, the more advantage the GPU has over the CPU.