hi Folks:

I am trying to change a .c file to a .cu file. These files are then called by MATLAB. I am following an example. But the index change seems pretty tricky. Here I will post some code (for the sake of simplicity, I deleted some code, and I hope everything is right).

The original .c code, which is a mex file:

```
typedef struct
{ // dimentions:
double *x; // Ns X 1
double *af; // Ns X 1
double *g; // Nt X 1
double *d; // Nt X 1
int Ns;
int Nt;
} simparams;
void *rot_sfc(void *arg)
{
simparams *p = (simparams *)arg;
int i,j;
double aft,ar;
double rot;
double xt;
double *g;
double *d;
double alfa,beta;
/* get pointers */
g = p->g;
d = p->d;
for(j = 0;j < p->Ns;j++)
{ /* initialize */
ar = 1;
aft = p->af[j];
xt = p->x[j];
for(i = p->Nt-1;i >= 0;i--)
{ /* apply rotation */
rot = xt*g[i];
alfa = cos(rot/2.0);
aft = aft*alfa;
ar = ar*z;
d[i] += aft*ar;
beta = sin(rot/2);
ar = ar*beta;
}
}
//pthread_exit(NULL);
}
```

Let me explain a little bit. The d is the returned variable, which is a Nt X 1 vector with all entries 0 created by MATLAB/c mex funtion mexFunction(). This is basically a double loop with outside loop looping through Ns, and inside looping through Nt. For each j, some value is calculated then added to d[i].

Here is the changed version with same return, as a .cu file:

```
__global__
void rot_sfc
( // dimentions:
double *x; // Ns X 1
double *af; // Ns X 1
double *g; // Nt X 1
double *d; // Ns X Nt
int Ns;
int Nt;
)
{
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i;
int idx;
double aft,ar;
double rot;
double xt;
double alfa,beta;
while(j < Ns)
{ /* initialize */
ar = 1;
aft = af[j];
xt = x[j];
for(i = Nt-1;i >= 0;i--)
{
idx = i*Ns + j;
d[idx] = 0;
rot = xt*g[i];
alfa = cos(rot/2.0);
aft = aft*alfa;
ar = ar*z;
d[idx] = aft*ar;
beta = sin(rot/2);
ar = ar*beta;
} /* Nt loop */
j += blockDim.x * gridDim.x;
} /* while j < Ns */
}
```

Similarly, the d is the returned variable, but instead of being a Nt X 1 vector, it is a Ns X Nt matrix now. It is initiated by MATLAB function

```
gpuArray(zeros(Ns,Nt))
```

. So basically it is a Ns X Nt 0 matrix at the beginning. Very importantly: it has to be post processed to get the same result as the previous case. The post processing is summing it up along Ns direction, such that we have a new vector of size Nt X 1, or 1 X Nt:

I am really confused about the mapping in index from .c to .cu file. The double loop in c file,

```
for(j = 0;j < p->Ns;j++)
{
for(i = p->Nt-1;i >= 0;i--)
{
}
}
```

seems clear and straightforward.

But the index in the .cu file:

```
int j = blockIdx.x*blockDim.x + threadIdx.x;
while(j < Ns)
{
for(i = Nt-1;i >= 0;i--)
{
idx = i*Nt_j;
d[idx] = ...
}
j += blockDim.x * gridDim.x;
}
```

But the problem is MATLAB would not allow me to create a Ns X Nt 0 matrix, because it exceeds the max limit. So I cannot do this:

```
gpuArray(zeros(Ns, Nt))
```

So I did this to get d:

```
__global__
void rot_sfc
( // dimentions:
double *x; // Ns X 1
double *af; // Ns X 1
double *g; // Nt X 1
double *d; // Ns X Nt
int Ns;
int Nt;
)
{
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i;
double aft,ar;
double rot;
double xt;
double alfa,beta;
while(j < Ns)
{ /* initialize */
ar = 1;
aft = af[j];
xt = x[j];
for(i = Nt-1;i >= 0;i--)
{
rot = xt*g[i];
alfa = cos(rot/2.0);
aft = aft*alfa;
ar = ar*z;
d[i] += aft*ar;
beta = sin(rot/2);
ar = ar*beta;
} /* Nt loop */
j += blockDim.x * gridDim.x;
} /* while j < Ns */
}
```

But the result is NOT right. I don’t know why. Perhaps the index messed up the whole thing?

seems pretty confusing. I mean, if j is not consecutive over natural integer number, doesn’t skip some locations in a Ns X Nt matrix?

I understand “blockIdx.x*blockDim.x + threadIdx.x” has something to do with how threads is formed in block in grid in GPU, but I just cannot figure out how these indices are mapped.

Anyone thoughts is highly appreciated.