index mapping from .c file to .cu file

hi Folks:

I am trying to change a .c file to a .cu file. These files are then called by MATLAB. I am following an example. But the index change seems pretty tricky. Here I will post some code (for the sake of simplicity, I deleted some code, and I hope everything is right).

The original .c code, which is a mex file:

typedef struct 

{				// dimentions:

	double *x; 	// Ns X 1

	double *af;	// Ns X 1

	double *g;	// Nt X 1

	double *d;	// Nt X 1

	int Ns;

	int Nt;

} simparams;

void *rot_sfc(void *arg)

{

	simparams *p = (simparams *)arg;

	int i,j;

	double aft,ar;

	double rot;

	double xt;

	double *g;

	double *d;

	double alfa,beta;

	/* get pointers */

	g = p->g;

	d = p->d;

for(j = 0;j < p->Ns;j++)

	{	/* initialize */

		ar = 1;

		aft = p->af[j];

		xt = p->x[j];

		for(i = p->Nt-1;i >= 0;i--)

		{	/* apply rotation */

			rot = xt*g[i];

			alfa = cos(rot/2.0);

			aft = aft*alfa;

			ar = ar*z;

			d[i] += aft*ar;			

			beta = sin(rot/2);			

			ar = ar*beta;

		}

	}

	//pthread_exit(NULL);

}

Let me explain a little bit. The d is the returned variable, which is a Nt X 1 vector with all entries 0 created by MATLAB/c mex funtion mexFunction(). This is basically a double loop with outside loop looping through Ns, and inside looping through Nt. For each j, some value is calculated then added to d[i].

Here is the changed version with same return, as a .cu file:

__global__

void rot_sfc

(				// dimentions:

	double *x; 	// Ns X 1

	double *af;	// Ns X 1

	double *g;	// Nt X 1

	double *d;	// Ns X Nt

	int Ns;

	int Nt;

)

{

	int j = blockIdx.x*blockDim.x + threadIdx.x;

	int i;

	int idx;

  	double aft,ar;

	double rot;

	double xt;

	double alfa,beta;

	while(j < Ns)

	{	/* initialize */	

		ar = 1;

		aft = af[j];

		xt = x[j];

		

		for(i = Nt-1;i >= 0;i--)

		{	

			idx = i*Ns + j;		

			d[idx] = 0;

			rot = xt*g[i];

			alfa = cos(rot/2.0);

			aft = aft*alfa;

			ar = ar*z;

			d[idx] = aft*ar;			

			beta = sin(rot/2);			

			ar = ar*beta;

		} /* Nt loop */

		j += blockDim.x * gridDim.x;    

	} /* while j < Ns */      

}

Similarly, the d is the returned variable, but instead of being a Nt X 1 vector, it is a Ns X Nt matrix now. It is initiated by MATLAB function

gpuArray(zeros(Ns,Nt))

. So basically it is a Ns X Nt 0 matrix at the beginning. Very importantly: it has to be post processed to get the same result as the previous case. The post processing is summing it up along Ns direction, such that we have a new vector of size Nt X 1, or 1 X Nt:

I am really confused about the mapping in index from .c to .cu file. The double loop in c file,

for(j = 0;j < p->Ns;j++)

{ 

    for(i = p->Nt-1;i >= 0;i--)

    {

    }

}

seems clear and straightforward.

But the index in the .cu file:

int j = blockIdx.x*blockDim.x + threadIdx.x;

while(j < Ns)

	{	

		for(i = Nt-1;i >= 0;i--)

		{

	            idx = i*Nt_j;

                    d[idx] = ...

	        } 

		j += blockDim.x * gridDim.x;    

	}

But the problem is MATLAB would not allow me to create a Ns X Nt 0 matrix, because it exceeds the max limit. So I cannot do this:

gpuArray(zeros(Ns, Nt))

So I did this to get d:

__global__

void rot_sfc

(				// dimentions:

	double *x; 	// Ns X 1

	double *af;	// Ns X 1

	double *g;	// Nt X 1

	double *d;	// Ns X Nt

	int Ns;

	int Nt;

)

{

	int j = blockIdx.x*blockDim.x + threadIdx.x;

	int i;

  	double aft,ar;

	double rot;

	double xt;

	double alfa,beta;

	while(j < Ns)

	{	/* initialize */	

		ar = 1;

		aft = af[j];

		xt = x[j];

		

		for(i = Nt-1;i >= 0;i--)

		{	

			rot = xt*g[i];

			alfa = cos(rot/2.0);

			aft = aft*alfa;

			ar = ar*z;

			d[i] += aft*ar;			

			beta = sin(rot/2);			

			ar = ar*beta;

		} /* Nt loop */

		j += blockDim.x * gridDim.x;    

	} /* while j < Ns */ 

}

But the result is NOT right. I don’t know why. Perhaps the index messed up the whole thing?

seems pretty confusing. I mean, if j is not consecutive over natural integer number, doesn’t skip some locations in a Ns X Nt matrix?

I understand “blockIdx.x*blockDim.x + threadIdx.x” has something to do with how threads is formed in block in grid in GPU, but I just cannot figure out how these indices are mapped.

Anyone thoughts is highly appreciated.