Greetings to all… It’s the first time I have posted and I dont know how naive my questions may be, so plz bear with me. I’ve been using CUDA to implement a series on algorithms. These algoritms are mostly composed of many for loops. The issue is that I’m trying to find out some techniques that I may use to reduce the thread register usage. So i can boost the MP occupancy and therefore decrease calculation time.

Here is my code:

int pixels = image_Rows;

int bands = image_Cols;

int endmembers = endmembers_Rows;

/*Dimensions of Grid*/

int bx = blockIdx.x;

int by = blockIdx.y;

/*Dimension of Block*/

int tx = threadIdx.x;

int ty = threadIdx.y;

/*Indicates the current thread number*/

int threadpixelnum = dimXYofGrid*blocksize*by+blocksize*bx+threadblockrowsize*ty+tx;

float numerator = 0.0f;

float denominator = 0.0f;

float dot;

int m;

int i;

int s;

int k;

__syncthreads();

if(threadpixelnum<pixels)

{

for(k=0; k<iterations;k++)

{

```
for (m = 0; m < (endmembers); m++)
{
for (i = 0; i <bands ; i++)
{
dot = 0.0f;
for (s = 0; s < endmembers; s++)
dot += (endmemberD[endmembers*i+s] * abundanceD[(pixels*s)+threadpixelnum]);
numerator += (endmemberD[(endmembers*i+m)]/dot) * imageD[(pixels*i)+threadpixelnum];
denominator +=(endmemberD[endmembers*i+m]);
}
abundanceD[(pixels)*m+threadpixelnum] = abundanceD[(pixels*m)+threadpixelnum] * (numerator / denominator);
denominator =0.0f;
numerator=0;
}
}
```

}

__syncthreads();

I don’t know how readable it may be but thats how it goes.

Currently this implmementation uses 20 reg and 64 smem…

I can use some pointers on how to optimize if i can in any way…

Thank u