Dear all,
I have observed incorrect results while performing a small experiment.
Here are the details of the experiment:
I am trying to sum the variables of an array.
I maintain the array on the device memory.
I maintain the sum array on the host memory.
Here is the code
#define N 512
__global__
void test(int *sum, int *b) {
int i = threadIdx.x;
if (i<N) {
atomicAdd(sum, b[i]);
}
}
int main() {
cudaDeviceReset();
int *sum, *d_sum, *b, *d_b;
cudaHostAlloc((void **) &sum, sizeof(int), cudaHostAllocMapped);
b = (int*) malloc(sizeof(int)*N);
cudaMalloc((void **)&d_b, sizeof(int)*N);
cudaMemcpy(d_b , b, sizeof(int)*N, cudaMemcpyHostToDevice);
cudaHostGetDevicePointer((void **)&d_sum, sum , 0);
for (int i=0 ; i<N ; i++) {
b[i] = 2;
}
*sum = 0;
cudaDeviceSynchronize();
cudaProfilerStart();
test<<<(N+255)/256,256>>>(d_sum, d_b);
cudaDeviceSynchronize();
cudaProfilerStop();
printf("Sum is: %d \n", *sum);
cudaFreeHost(sum);
free(b);
cudaFree(d_sum);
cudaFree(d_b);
}
Result : 0
Architectural and compiler detail :
CUDA : 10.1
Kernel : 5.0.0-31-generic
OS : Ubuntu 18.04
Driver : 418.67
Hardware : GeForce RTX 2080
Can someone explain what is the reason for the discrepancy in the results?
Thanks,
Shweta