returning data from a global cuda function?

I am starting out with cuda, and am trying a simple example where I send two arrays into a global function, copy one to the other, and return the second one.

I have:

__global__
void add(int n, int *tri, int *y)
{
    int index = threadIdx.x;
    int stride = blockDim.x;
    for (int i = index; i < n; i += stride)
        y[i] = tri[i];
}

and:

//local copy of data
    int *tri2 = tri; // data checked, and is valid

    int *y = new int[width * height]; // same size as `tri`
    int N = width * height;

    // Allocate Unified Memory – accessible from CPU or GPU
    cudaMallocManaged(&tri2, N * sizeof(int));
    cudaMallocManaged(&y, N * sizeof(int));

    // initialize y array on the host
    for (int i = 0; i < N; i++) {
        y[i] = 2;
    }

    // Run kernel on the GPU
    add << <1, 256 >> >(N, tri2, y);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    //copy back to host
    int i = 0;
    int f = -999.0; /* CPU copy of value */
    cudaMemcpy(&f, &y[i], sizeof(int), cudaMemcpyDeviceToHost);

    std::cout << "back: " << f << std::endl;
    std::cout << "orig: " << tri[i] << std::endl;

The orig value is 128, the same as when it went in. the returned f value is always 0. What am i missing?

perhaps you should try using proper cuda error checking and also run your code with cuda-memcheck