Simply:

I have been trying this for days, and I am thankful for those who helped. However I have not yet figure this out. I am trying to implement a double loop with outside loop over j and inside loop over i. I want for each i, all value of j will be summed then put into this i. However, I could not get this kernel right. I think the most reasonable program structure for what I want to do is:

**global**

void afc_example

(

```
double *d1,*d2;
double *afr,
double *afl,
const double *x,
const double *g,
const int Ns,
const int Nt
```

)

{ /* j: outside loop; i: inside loop */

```
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i;
double art1,art2;
double phs;
double zr, zi;
```

while(j < Ns)

```
{ /* initialize */
```

for(i = Nt-1;i >= 0;i–)

```
{
phs = x[j]*g[i];
zr = cos(phs/2.0);
zi = sin(phs/2.0);
art1 = afr[j]*zr + afl[j]*zi;
art2 = afr[j]*zi - afl[j]*zr;
d1[i] += art1;
d2[i] += art2;
```

} /* i */

j += blockDim.x*gridDim.x;

```
} /* j */
```

}

And here is my atomicAdd():

```
__device__ inline void atomicAdd(double *address, double value) {
unsigned long long oldval, newval, readback;
oldval = __double_as_longlong(*address);
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
while ((readback = atomicCAS((unsigned long long *) address, oldval, newval)) != oldval) {
oldval = readback;
newval = __double_as_longlong(__longlong_as_double(oldval) + value);
}
}
```

It is basically a double loop. I choose doing parallel computing on j, because some variables are updated with each iteration of i. So each ith iteration is related to its previous, (i-1)th, iteration. j is good on parallel computing because j computing is only related to jth location. In j dimension, processing is independent. It is like there is a 2D matrix that I want to fill in. The 2D matrix has size of Ns X Nt. I can fill(process) in each column at same time, which is equivalent to process all j simultaneously for a fixed i. Then I move to next i, do the same thing, processing all j again at the same time. Then go on, after all i are processed, meaning, hitting Nt.

After getting this Ns X Nt matrix, I am adding along j dimension, which is along column direction, from top to bottom. By doing so, finally I will have a 1 X Nt vector, which is d[i]

I think I still need to use atomicAdd(). Just have not figured it out yet.

I highly appreciate any thoughts or pointers. Thanks a lot.