Performance of addition of two vectors

Dear All,
I am a beginner in Cuda. I followed the example of An Even Easier Introduction to CUDA by Mark Harris. I was surprised by the poor performance of my Nvidia GeForce RTX 2060 Super card to add 2 vectors with 1 million floats

Program add.cu
#include
#include <math.h>
global void add(int n,float x,float y)
{
int index=blockIdx.x
blockDim.x+threadIdx.x;
int stride=blockDim.x
gridDim.x;
for (int i=index; i<n; i += stride) y[i]=x[i]+y[i];
}

int main(void)
{
int N=1<<20;
float x, y;
cudaMallocManaged(&x, N
sizeof(float));
cudaMallocManaged(&y, N
sizeof(float));

for (int i=0; i<N; i++) {x[i]=1.0f; y[i]=2.0f;}

int blockSize=1024;
int numBlocks=(N+blockSize-1)/blockSize;

add<<<numBlocks,blockSize>>>(N,x,y);
// add<<<1,256>>>(N,x,y);
cudaThreadSynchronize();

cudaFree(x);
cudaFree(y);

return 0;
}

nvprof output file
==7076== NVPROF is profiling process 7076, command: add.exe
==7076== Profiling application: add.exe
==7076== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 32.576us 1 32.576us 32.576us 32.576us add(int, float*, float*)
API calls: 80.37% 301.30ms 2 150.65ms 3.3637ms 297.94ms cudaMallocManaged
13.11% 49.138ms 1 49.138ms 49.138ms 49.138ms cuDevicePrimaryCtxRelease
5.95% 22.299ms 1 22.299ms 22.299ms 22.299ms cudaLaunchKernel
0.42% 1.5657ms 2 782.85us 777.30us 788.40us cudaFree
0.08% 316.70us 97 3.2640us 100ns 159.20us cuDeviceGetAttribute
0.04% 147.50us 1 147.50us 147.50us 147.50us cudaThreadSynchronize
0.03% 118.90us 1 118.90us 118.90us 118.90us cuModuleUnload
0.00% 17.800us 1 17.800us 17.800us 17.800us cuDeviceTotalMem
0.00% 4.7000us 3 1.5660us 200ns 4.0000us cuDeviceGetCount
0.00% 1.3000us 2 650ns 100ns 1.2000us cuDeviceGet
0.00% 600ns 1 600ns 600ns 600ns cuDeviceGetName
0.00% 200ns 1 200ns 200ns 200ns cuDeviceGetUuid
0.00% 200ns 1 200ns 200ns 200ns cuDeviceGetLuid

==7076== Unified Memory profiling result:
Device “GeForce RTX 2060 SUPER (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
258 31.751KB 4.0000KB 32.000KB 8.000000MB 20.16850ms Host To Device
256 32.000KB 32.000KB 32.000KB 8.000000MB 66.22950ms Device To Host

I cannot understand these very long times involving the host.
Your advice would be very appreciated.
Pascal