Problem: v1 and v2 are two vectors, dv is a double value; now I wanna make v1[i] -= dv*v2[i];

This should be quite a simple problem, my CPU version code is :

for (i = 0;i < N; i++) v1[i] -= dv * v2[i];

my CUDA version is :

**global** void vec_sub_mul_kernel(double *T, double S, double V)*blockDim.x+threadIdx.x] * V;

{

T[blockIdx.xblockDim.x+threadIdx.x] -= S[blockIdx.x

}

void vec_sub_mul(double *tgt, double *src, double dv, int n)

{

cudaMemset(d_vx,0,n * sizeof(double));

cudaMemset(d_vy,0,n * sizeof(double));

```
cudaMemcpy(d_vx, src, n *sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_vy, tgt, n *sizeof(double), cudaMemcpyHostToDevice);
vec_sub_mul_kernel<<<grid_dim,block_dim,block_dim.x*sizeof(double)>>>(d_vy,d_vx,dv);
cudaMemcpy(tgt,d_vy,n*sizeof(double),cudaMemcpyDeviceToHost);
```

}

These two are supposed to give exactly the same results. However for some reason, they sometimes give the same results and sometimes

don’t, the difference is not very big though.

PS: to utilize GPU for best performance, is my CUDA code good? If not how to improve it?