How to force array not to be allocated in local memory?

My kernel needs 24 floats for each threads, so I declared them as an array to ease coding. Otherwise, coding could become very very exhausting. However, compiler keeps allocate this array in local memory, causing lots of local mem access. I am sure that there are enough register to hold all these values. Is there a way to force the comiler not to use local memory?

P.S. all offsets to access this array is known at compile time.

Instead of using a variable to access individual components … explicitly address them as A[1] = data , A[2] = data2 where A is the array you are talking about. That should help the compiler put that array in register. Btw an Array of 24 floats will be hard to access explicityl liek that, I suggest using multiple smaller arrays.

Thanks! I broke it into some small arrays, and it worked!

What do you mean by “all offsets to access this array is known at compile time”?

you could do something like, example code:

__global__ static void myKernel(float* input, float* out)

{

float reg_vals[24];

#pragma unroll

for(int i = 0; i < 24; i++)

reg_vals[i] = in[...... + treadIdx.x + i*something];

// do something

#pragma unroll

for(int i = 0; i < 24; i++)

reg_vals[i] += 1;

// write back

#pragma unroll

for(int i = 0; i < 24; i++)

out [....]  = reg_vals[i];

}

You should also make sure that your maxrregcount is set high enough.

merry christmas!

To Kiran_CUDA: It simply means there is no indirect global memory access.

To Jimmy: Actually, that’s how I wrote it. The problem is there is texture load and __synchtreads() in the loop body, so the compiler wouldn’t unroll the body. Anyway I manually unrolled it. I didn’t try it, but I think maybe whether unrolling or not won’t affect the decision of where to put the arrays. Nitin.life’s suggestion is good. And high maxrregcount seems have no effect either. Thanks!

P.S. Maybe nvcc should try to unroll the loop when there is textrue load or __syncthreads() inside the body, which will reduce programers’ work a lot!

I’ve been able to unroll loops with __syncthreads() before without a problem. Maybe your problem lies with the texture loads? Tigga once told me he had trouble unrolling when these calls were involved…

I apologize for posting a question on this thread already open, but since is a similar topic I wish that could be usefull to continue here instead of open a new thread.

I would like to put a 16 element array in the registers without the use of local memory, also in my case I dont saturate the register space.

My code is like that:

[codebox]

int pre_histor[16];

for(int ipm=0;ipm<PM_X_THREAD;ipm++) {

 for(int i=0;i<16;i++) {pre_histor[i]=0;}

 float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));

 int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

 if(ndisx<16) pre_histor[ndisx]++;

}

[/codebox]

with just this code I haven’t problem, everithing stay in the registers. But if i put something like

maxhx = pre_histor[4];

or something more complicated like

#pragma unroll

for (int i=0;i<16;i++) {

   if (pre_histor[i]>maxhx) {                                                                                                       

       maxhx = pre_histor[i];

       maxnx = i;                                                                                                                        

       }

  }

I have migration to local mem:

  Used 22 registers, 64+0 bytes lmem, 12704+16 bytes smem, 8 bytes cmem[0], 80 bytes cmem[1]                

Could you me explain why and in which way i can avoid this? I tried to split the 16 elements array in two 8 elements arrays obtaining exactly the same behaviour…

Thx in advance,

g.

another information:

the variables maxnx and maxhx are related to the kernel output;

if instead of the last piece of code above there is:

int aa;

#pragma unroll

for (int i=0;i<16;i++) {

   if (pre_histor[i]>maxhx) {                                                                                                       

       aa = pre_histor[i];

       // maxnx = i;                                                                                                                        

       }

  }

where aa is just a local variable, I don’t have any local memory migration.

There is a way to force the compiler in this case?

g.

The problem is that the compiler is not very good at figureing out indexing of arrays. If you can substitute your array with registers or if you can manually unroll that “for” loop then the array will go into registers else …

try

loop starts

if (pre_histor[i]>maxhx) { 

		   aa = pre_histor[i];

		   }

else {aa = 0.0f;}

if(aa>0.0f) maxnx = i;

loop ends

provided aa is always >0.0f

i wouldn’t say the compiler isn’t very good, it’s just that it doesn’t know “maxhx” at compile time. If you want to unroll something all variables must be known to the compiler…

I tried your recipe, but, unfortunally, doesn’t work.

In any case I think that the problem is not related (directly) to the loop and the unrolling:

indeed with

if (pre_histor[5]>maxhx) {

	   aar = pre_histor[5]; }

	 else {aar=0.0f;}

	 if (aar>0.0f) maxnx = 5;

out of the for-loop the migration is still there.

The simple statement (without for-loops)

maxnx = pre_histor[5];

produces: Used 12 registers, 64+0 bytes lmem, 12704+16 bytes smem, 8 bytes cmem[0], 28 bytes cmem[1]

while

aar = pre_histor[5];

produces: Used 8 registers, 12704+16 bytes smem, 8 bytes cmem[0], 20 bytes cmem[1]

the only difference is that aar is a local-dummy variable while the maxnx variable is directly related to the kernel output.

Do you have any idea?

g.

Could you post your code?

I think that the problem is following code

int pre_histor[16];

.....

for(int ipm=0;ipm<PM_X_THREAD;ipm++) {

	 for(int i=0;i<16;i++) {pre_histor[i]=0;}

   .....

	 float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));

	 int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));

	 if(ndisx<16) pre_histor[ndisx]++;

}

Compiler could not use registers to replace pre_histor[16] since it does not know what ndisx is.

Hence you will have 16 x 4 byte local memory here.

Of course, if you don’t output value of “pre_histor”, then computation of for-loop is useless,

compiler would remove whole for-loop, then no local memory is used.