Help using CUDA for MRI gridding

Hi,
I am trying to implement MRI gridding using CUDA. The algorithm basically tries to interpolate the data on the spiral to the cartesian grid. I have designed the algorithm as an input driven assignment, meaning the thread ids are assigned for the input spiral data. Now the input spiral data is a 1 X N data of complex numbers. Now the grid is a K X K which is much lesser in length than the input data. In my case the input data is of length 1 X 12712 and the grid is 64 X 64. I then call it back to MATLAB using the mexx function. Here are some snippets of the code :
Assignment of the grid dimensions on the device :
dim3 dimBlock(256);
dim3 dimGrid((1len)/dimBlock.x);
if((1
len)%256!=0)dimGrid.x+=1;

here len = 12712.

calling the device:
interpolation<<<dimGrid,dimBlock>>>(devicein_r,devicein_i,K,N,J,kx1,ky1,deviceout_r,deviceou
t_i,fn,Ofactor,len,dcf);

// retruning the data from the device after interpolation:
global void interpolation(float *devicein_r, float *devicein_i, int K,int N,
int J,float *kx1,float *ky1,float *deviceout_r,float *deviceout_i,float *fn,int Ofactor, int len,float dcf)
{
int idx = blockIdx.x
blockDim.x + threadIdx.x;

//initializing the device output to zero. 
  if(idx< K*K)

{
	*(deviceout_r + idx) = 0.0;
	*(deviceout_i + idx) = 0.0;
}
   
   for (x=minx;x<=maxx;x++)

      {  
          ..... (interpoation code)
          
  interp = interpx*interpy;
     temp_cord = (xtemp+1)+ (ytemp*K) - 1;
*(deviceout_r+temp_cord)+= *(devicein_r+idx)*interp*(*(dcf+idx));
    *(deviceout_i+temp_cord)+= *(devicein_i+idx)*interp*(*(dcf+idx));
      }
  }  

Now when i try to run this i find that when the data set increases beyond 5000 the values all turn up to be zero even though the thread ids are supposed to run till the len. I am not sure if the problem is with the thread assignment or the in how the device out array is called. I am having serious problems with this.

Also I know that the max threads that can be called is 12288 for a Gforce 8800. But in the above case it should work at least till those many thread assignments rite. Also how do i handle those that go beyond that. I mean will a simple for loop without using idx work within the device? Please help.

Regards
Kashyap

You will find the following paper of interest:
[url=“Accelerating the Nonequispaced Fast Fourier Transform on Commodity Graphics Hardware | IEEE Journals & Magazine | IEEE Xplore”]http://ieeexplore.ieee.org/xpl/freeabs_all...18&index=11[/url]

/ Thomas

Hi thomas,

I have gone through the paper… My problem is more programming based… In particular i am not sure how to unroll the loops for the interpolation code… My code has a lot of divergent if-else statements… So the threads have to wait until that is done… But as of now optimization is not my main problem… I need to get this code working… Then i can think of optimizing… As of now all the reads are from the global memory… Another problem i think along with the divergent statements is the depth of the looping statements… Here is the interpolation part of the code that i have written… It basically reads an input location from the kspace… The interpolation kernel has been pre-computed… I just find all the coordinates which are affected by the interpolation… In my case the length of the interpolation kernel J is 6… I then find the coordinates on the grid and multiply the interpolation kernel with the input values… Most of the time I get 0 output for the code… But on commenting snippets of the code and trying to figure out where i am going wrong also has not helped…

[codebox]

dim3 dimBlock(256);

dim3 dimGrid((1*len)/dimBlock.x);

if((1*len)%256!=0)dimGrid.x+=1;

here len = 12712.

calling the device:

interpolation<<<dimGrid,dimBlock>>>(devicein_r,devicein_i,K,N,J,kx1,ky1,deviceout_r,deviceou

t_i,fn,Ofactor,len,dcf);

// retruning the data from the device after interpolation:

__global__ void interpolation(float *devicein_r, float *devicein_i, int K,int N,

int J,float *kx1,float *ky1,float *deviceout_r,float *deviceout_i,float *fn,int Ofactor, int len,float *dcf, float *output1)

{

int idx = (blockIdx.x*blockDim.x + threadIdx.x);

float e; // -- used as a temporary variable to store the OFactor. which is the length of the interpolation function. 

float outx,outy;

float kxvalue,kyvalue;//--used to store the input cordinates from the k space. 

float minx,maxx,miny,maxy;//--limits to the range on the grid that the spiral k space values can affect. 

float x,y;

int  floorx,floory;

float resx,resy;

float interpx, interpy, interp;//--interpolation 

int xtemp,ytemp,temp_cord;

float xvalue,yvalue;//--used to compute the cordinates of the output on the cartesian grid. 





if(idx<len)

{

kxvalue = *(kx1+idx); 

kyvalue = *(ky1+idx);

   minx = floor(kxvalue-(float)J/2);

   maxx =ceil(kxvalue+(float)J/2);

   miny = floor(kyvalue-(float)J/2);

   maxy =ceil(kyvalue+(float)J/2);

	

for (x=minx;x<=maxx;x++)

{

	xvalue = fabs(kxvalue - x);

	e=Ofactor;

            if(xvalue>(float)J/2+1/e) 

             interpx=0;

	else

            { 

	     floorx = floor(xvalue*Ofactor);

                 e =floorx;

	     resx = (xvalue-e/(float)Ofactor)*(float)Ofactor;

 	     interpx = *(fn+floorx)*(1-resx)+*(fn+floorx+1)*resx;

             }

		

        xtemp = x;

		

        if(xtemp<0) xtemp = xtemp + K;

		

	else

            if(xtemp>=K) xtemp = xtemp - K;

		

	

		for (y=miny;y<=maxy;y++)

		 { 

			 yvalue = fabs(kyvalue - y); 

			if(yvalue>(float)J/2+1/(float)Ofactor) 

				interpy = 0;

				else

				{

					floory = floor(yvalue*Ofactor);

					e=floory;

					resy = (yvalue-(float)floory/(float)Ofactor)*(float)Ofactor;

					interpy = *(fn+floory)*(1-resy)+*(fn+floory+1)*resy;

				}

        ytemp = y;

if(ytemp<0) ytemp = ytemp + K;

else

        if(ytemp>=K) ytemp = ytemp-K; 

   interp=interpx*interpy;

		

	//defining the temporary cord and the outputs.

		temp_cord = (xtemp+1)+(ytemp*K -1) ;

		outx = (*(devicein_r + idx) * interp);

		outy = (*(deviceout_i + idx)* interp);

		

	  *(deviceout_r + temp_cord) = outx;

	  *(deviceout_i + temp_cord) = outy;

[/codebox]

on comparing with the cpu based gridding this one does not seem to take all the values and compute them… Any suggestions to make this work are most welcome… Here K is the size of the grid.

First, you should use the code tags to paste code, especially this long.
Second, indent it, cause this is not easy to read!

Now, a couple of variables, i cant see the definition of. What is ‘e’?
Youre using J/2 everywhere (as far as i could spot), why not pass that as a parameter instead of (slowly) dividing it every time?

Now, onto the problem at hand. Have you checked for error when running the kernel using cuda_safe_call or something like that? If you get 0 as the output, perhaps its not even executing and from there you can check for out of bounds memory accesses.

Perhaps a check that temp_cord is lower than sizeof(deviceout_r) and such? To make sure youre not writing out of bounds.

Have you run it in emudebug mode? Does it return something other than 0s?

Hi,

sorry about the code… I have edited it and tried to make it as clear as possible. I did use CUT_CHECK_ERROR after calling the kernel. But the program returned no error when i compiled it and i still got 0 as the result. I was not able to use the emudebug because this is actually a mex Matlab code. So when i try using just the nvcc compiler it is not able to recognize the mex format. Also the temp_cord is calculated to access all the points on the grid. So it does not go out of bounds. It just accesses that particular coordinate on the grid and adds the result of the input multiplied by the interpolation kernel calculated. It is an accumulator array so the value of temp cord gets repeated for different values of the input.

I am very new to programming on NVIDIA hardware, but I will make a suggestion anyway.
Insert into the code suitable output statements so that you can verify things are working as
you intend. These are sometimes referred to as “debug” statements. It usually doesn’t work
to “read” the code to infer what the code is doing. You must actually run the code and look
at the intermediate results to verify that all is well.

Good luck, and let us know what you find out.

Hi kashyap

I think emulation mode is a must for you for these kind of issues. If you single step through the kernel and look at all the intermediate results you will figure out what is going on.

However, be warned that in our paper we advise against the “one thread per sample” approach you are using. You need to take special care to avoid write conflicts if/when several threads attempt to write to one grid location concurrently. And this issue will not show up in emulation mode - only when running on the device. Instead we suggest to start a thread for each grid location and use a precomputed data structure to assign samples to each grid position. In our the paper this is generalized to input domains, which you can implement using shared memory.

/ Thomas

Hi Thomas,

Sorry for the late reply. I did setup the emulation mode and run the code in that. And it worked fine as it is expected to. But when I run it back again on the GPU it just gives 0 as the answer. I still dont know why. I am lost for answers. The code is still the same as i put up in this discussion before. I hope some one can tell me where i am going wrong…

Regards,

Kashyap

The only time ive got 0 as the output was when the code was not actually run. You mention youve used cut_check_error, but was the code compiled in Release mode? If so, the macro gets ignored.

If it works fine in emudebug, youve got something like a race condition, out of bounds array indexing, invalid execution configuration (sizes of blocks and grid), something along those lines.
Im not familiar enough with the matlab package to help you more than that. Good luck!

Yes when executing on the device it ran in the release mode… Perhaps that is why i did not get any errors… Well i will look into these possible aspects and see where i am going wrong… I wanted to know if the depth of the loop is affecting the program also… Like in the code there are three for loops and multiple diverging if else statements… The literature says that i will slow down the computation but is there any other effect… I mean is the thread able to handle such multiple loops and all… Also the program has multiple writes to the global memory array… so different threads may write to the same memory location at the same time… Could that also be a factor???

Regards,

Kashyap