I’m new to CUDA, and I was hoping someone might be able to see my mistake here. I’m porting some mathematical code over to the GPU, but so far I can’t seem to get a basic test running: copying from and to the device memory is working (if I edit the array on the host memory, then use cudaMemcpyDeviceToHost, it’s correctly overwritten) but editing on the GPU doesn’t seem to do anything. The actual function is as follows:

```
__global__ void cudatest(double* g_h) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int n = 64;
if (i < (n+2) && j < (n+2)) {
g_h[(n+2)*i+j] = 75;
}
}
```

and is called using

```
dim3 threadsPerBlock(16, 16);
dim3 numBlocks((n+2) / threadsPerBlock.x, (n+2) / threadsPerBlock.y);
cudatest<<<numBlocks, threadsPerBlock>>>(g_h);
cudaMemcpy(h, g_h, (n+2)*(n+2)*sizeof(double), cudaMemcpyDeviceToHost);
mat_out(h, cout);
```

I’m not getting any kind of error, but the output is the array that was initially copied to the device (consisting of all ones), not the overwritten array of all 75s as it should (as far as I can see) be. Any help to point out where I’m going wrong would be hugely appreciated!