Code optimization for Register usage reduction CUDA optimization

Greetings to all… It’s the first time I have posted and I dont know how naive my questions may be, so plz bear with me. I’ve been using CUDA to implement a series on algorithms. These algoritms are mostly composed of many for loops. The issue is that I’m trying to find out some techniques that I may use to reduce the thread register usage. So i can boost the MP occupancy and therefore decrease calculation time.

Here is my code:

int pixels = image_Rows;
int bands = image_Cols;
int endmembers = endmembers_Rows;

/Dimensions of Grid/
int bx = blockIdx.x;
int by = blockIdx.y;

/Dimension of Block/
int tx = threadIdx.x;
int ty = threadIdx.y;

/Indicates the current thread number/
int threadpixelnum = dimXYofGridblocksizeby+blocksizebx+threadblockrowsizety+tx;

float numerator = 0.0f;
float denominator = 0.0f;

float dot;
int m;
int i;
int s;
int k;

for(k=0; k<iterations;k++)

	for (m = 0; m < (endmembers); m++)
	       for (i = 0; i <bands ; i++)
	       dot = 0.0f;
           		       for (s = 0; s < endmembers; s++)
          		           dot += (endmemberD[endmembers*i+s] * abundanceD[(pixels*s)+threadpixelnum]);
          		       numerator += (endmemberD[(endmembers*i+m)]/dot) * imageD[(pixels*i)+threadpixelnum];
                                   denominator +=(endmemberD[endmembers*i+m]);
	abundanceD[(pixels)*m+threadpixelnum] = abundanceD[(pixels*m)+threadpixelnum] * (numerator / denominator);
          	                denominator =0.0f;



I don’t know how readable it may be but thats how it goes.

Currently this implmementation uses 20 reg and 64 smem…

I can use some pointers on how to optimize if i can in any way…

Thank u