My kernel needs 24 floats for each threads, so I declared them as an array to ease coding. Otherwise, coding could become very very exhausting. However, compiler keeps allocate this array in local memory, causing lots of local mem access. I am sure that there are enough register to hold all these values. Is there a way to force the comiler not to use local memory?
P.S. all offsets to access this array is known at compile time.
Instead of using a variable to access individual components … explicitly address them as A[1] = data , A[2] = data2 where A is the array you are talking about. That should help the compiler put that array in register. Btw an Array of 24 floats will be hard to access explicityl liek that, I suggest using multiple smaller arrays.
To Kiran_CUDA: It simply means there is no indirect global memory access.
To Jimmy: Actually, that’s how I wrote it. The problem is there is texture load and __synchtreads() in the loop body, so the compiler wouldn’t unroll the body. Anyway I manually unrolled it. I didn’t try it, but I think maybe whether unrolling or not won’t affect the decision of where to put the arrays. Nitin.life’s suggestion is good. And high maxrregcount seems have no effect either. Thanks!
P.S. Maybe nvcc should try to unroll the loop when there is textrue load or __syncthreads() inside the body, which will reduce programers’ work a lot!
I’ve been able to unroll loops with __syncthreads() before without a problem. Maybe your problem lies with the texture loads? Tigga once told me he had trouble unrolling when these calls were involved…
I apologize for posting a question on this thread already open, but since is a similar topic I wish that could be usefull to continue here instead of open a new thread.
I would like to put a 16 element array in the registers without the use of local memory, also in my case I dont saturate the register space.
My code is like that:
[codebox]
int pre_histor[16];
…
for(int ipm=0;ipm<PM_X_THREAD;ipm++) {
for(int i=0;i<16;i++) {pre_histor[i]=0;}
…
float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));
int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));
if(ndisx<16) pre_histor[ndisx]++;
}
[/codebox]
with just this code I haven’t problem, everithing stay in the registers. But if i put something like
maxhx = pre_histor[4];
or something more complicated like
#pragma unroll
for (int i=0;i<16;i++) {
if (pre_histor[i]>maxhx) {
maxhx = pre_histor[i];
maxnx = i;
}
}
Could you me explain why and in which way i can avoid this? I tried to split the 16 elements array in two 8 elements arrays obtaining exactly the same behaviour…
The problem is that the compiler is not very good at figureing out indexing of arrays. If you can substitute your array with registers or if you can manually unroll that “for” loop then the array will go into registers else …
try
loop starts
if (pre_histor[i]>maxhx) {
aa = pre_histor[i];
}
else {aa = 0.0f;}
if(aa>0.0f) maxnx = i;
loop ends
i wouldn’t say the compiler isn’t very good, it’s just that it doesn’t know “maxhx” at compile time. If you want to unroll something all variables must be known to the compiler…