Incorrect result obtained while computing atomic operation on variables stored in system memory and device memory

Dear all,

I have observed incorrect results while performing a small experiment.

Here are the details of the experiment:

I am trying to sum the variables of an array.
I maintain the array on the device memory.
I maintain the sum array on the host memory.

Here is the code

#define N 512 
 
 __global__
 void test(int *sum, int *b) {
     int i = threadIdx.x;
     if (i<N) {
         atomicAdd(sum, b[i]);
     }
 }
 
 int main() {
     cudaDeviceReset();
     int *sum,  *d_sum, *b, *d_b;
     cudaHostAlloc((void **) &sum, sizeof(int), cudaHostAllocMapped);
 
     b = (int*) malloc(sizeof(int)*N);
     cudaMalloc((void **)&d_b, sizeof(int)*N);
     cudaMemcpy(d_b , b, sizeof(int)*N, cudaMemcpyHostToDevice);
 
 
     cudaHostGetDevicePointer((void **)&d_sum, sum , 0);
 
     for (int i=0 ; i<N ; i++) {
         b[i] = 2;
     }
 
     *sum = 0;
     cudaDeviceSynchronize();
     cudaProfilerStart();
     test<<<(N+255)/256,256>>>(d_sum, d_b);
 
     cudaDeviceSynchronize();
     cudaProfilerStop();
 
     printf("Sum is: %d \n", *sum);
 
     cudaFreeHost(sum);
     free(b);
     cudaFree(d_sum);
     cudaFree(d_b);
 }

Result : 0

Architectural and compiler detail :

CUDA : 10.1
Kernel : 5.0.0-31-generic
OS : Ubuntu 18.04
Driver : 418.67
Hardware : GeForce RTX 2080

Can someone explain what is the reason for the discrepancy in the results?

Thanks,
Shweta

Some random thoughts:

“Discrepancy” and “results” implies that there is more than one result. I only see one result. Where is the other?

Why do you first copy ‘b’ to ‘d_b’, then initialize ‘b’? Shouldn’t the order be the other way around?

This line is also incorrect:

cudaFree(d_sum);

but it is not central the problem, which is as njuffa described.