Problem with reducing registers

Hello,

I am still new to CUDA and I have a kernel which takes most of my computing time on the GPU, the NVIDIA Profiler measures 33% Occupancy with a register count of 50.

I can see why the register count is so high, but i have no idea how to reduce the amount of used registers. Is there anything I can do?

__global__ void cuda_COMP_FG(double *U, double *V, double *F, double* G,int imax, int jmax, int offset,  double delx,double dely, double delt, double alpha, double Re, double GY, double GX  ){

	int i = blockIdx.x * BLOCK_SIZE + threadIdx.x+1; 

	int j = blockIdx.y * BLOCK_SIZE + threadIdx.y+1;

	

	if(i<imax){

	double uu_xx=-2.0*U[i+j*offset];

	       uu_xx+=U[i-1+j*offset];

	       uu_xx+=U[i+1+j*offset];

	       uu_xx/=delx*delx;

	double uu_yy=-2.0*U[i+j*offset];

               uu_yy+=U[i+(j+1)*offset]+U[i+(j-1)*offset];

	       uu_yy/=dely*dely;

	double uu_x=((U[i+j*offset]+U[i+1+j*offset])*(U[i+j*offset]+U[i+1+j*offset])*0.25-(U[i-1+j*offset]+U[i+j*offset])*(U[i-1+j*offset]+U[i+j*offset])*0.25)/delx + (alpha/delx)*((fabs(U[i+j*offset]+U[i+1+j*offset])*(U[i+j*offset]-U[i+1+j*offset])/4)-fabs(U[i-1+j*offset]+U[i+j*offset])*(U[i-1+j*offset]-U[i+j*offset])*0.25 ); //OK

	double uv_y=((V[i+j*offset]+V[i+1+j*offset])*(U[i+j*offset]+U[i+(j+1)*offset])*0.25 - (V[i+(j-1)*offset]+V[i+1+(j-1)*offset])*(U[i+(j-1)*offset]+U[i+j*offset])*0.25)/dely + (alpha/dely)*(fabs(V[i+j*offset]+V[i+1+j*offset])*(U[i+j*offset]-U[i+(j+1)*offset])*0.25 - fabs(V[i+(j-1)*offset]+V[i+1+(j-1)*offset])*(U[i+(j-1)*offset]-U[i+j*offset])*0.25 );//OK

			

		F[i+j*offset]=U[i+j*offset]+delt*((1.0/Re)*(uu_xx+uu_yy)-uu_x-uv_y+GX);

			

		

	}

	else{

		F[j*offset]=U[j*offset];

		F[imax+j*offset]=U[imax+j*offset];	

	}

	if(j<jmax){

		double vv_xx=-2.0*V[i+j*offset];

		     vv_xx+=V[i+1+j*offset];

		     vv_xx+=V[i-1+j*offset];

		     vv_xx/=delx*delx;	

		double vv_yy=-2.0*V[i+j*offset];//OK

		     vv_yy+=V[i+(j+1)*offset];

		     vv_yy+=V[i+(j-1)*offset];//OK

		     vv_yy/=dely*dely;

		double vv_y=((V[i+j*offset]+V[i+(j+1)*offset])*(V[i+j*offset]+V[i+(j+1)*offset])/4.0 - (V[i+(j-1)*offset]+V[i+j*offset])*(V[i+(j-1)*offset]+V[i+j*offset])/4.0)/dely;

vv_y+=(alpha/dely)*( fabs(V[i+j*offset]+V[i+(j+1)*offset])*(V[i+j*offset]-V[i+(j+1)*offset])/4.0 - fabs(V[i+(j-1)*offset]+V[i+j*offset])*(V[i+(j-1)*offset]-V[i+j*offset])/4.0 ); //OK

		double uv_x=((U[i+j*offset]+U[i+(j+1)*offset])*(V[i+j*offset]+V[i+1+j*offset])/4.0 - (U[i-1+j*offset]+U[i-1+(j+1)*offset])*(V[i-1+j*offset]+V[i+j*offset])/4.0  )/delx +(alpha/delx)*(fabs(U[i+j*offset]+U[i+(j+1)*offset])*(V[i+j*offset]-V[i+1+j*offset])/4.0 - fabs(U[i-1+j*offset]+U[i-1+(j+1)*offset])*(V[i-1+j*offset]-V[i+j*offset])/4.0  );//OK

		G[i+j*offset]=V[i+j*offset]+delt*((1.0/Re)*(vv_xx+vv_yy)-uv_x-vv_y+GY);

	}

	else{

		G[i]=V[i];

		G[i+jmax*offset]=V[i+jmax*offset];

	}

	

}

Euhm yeah how about calculating some of those formula’s and stuffing it into a variable, and then re-use that variable for all the same formula’s.

Perhaps that will save some registers ! ;)

(I can’t resist to put a little joke in: such long formula’s are unhealthy ! ;) =D =D :w00twave: :rofl: )

Welcome to the CUDA forums, Puffski!
In case you are not yet familiar with out new forum clown: Don’t bother following his advice. The compiler of course does it’s own register allocation, so such changes would not have any effect on the generated code.

Yeah I have already tried to split these equations up but the results were the same.

And the old one’s duly reporting here too!!

Well, the first thing you could try is of course the -maxregcount argument. Try -maxregcount 32 to limit the number of registers used per thread to 32.

This will probably introduce some register spillage. You can use shared memory to store some variable if it’s not used during a large interval, and then load it into a variable when you need to use the value again. This would help you bypass the L2 used by local memory, which is the default place for spilled registers.

I see that the 4 aa_bb variables would fit nicely into 16 bytes(2 double) of shared memory.

I guess with some reordering of your code things could be optimized further, but I’m having a hard time reading your code.

EDIT: try inlining the fabs functions. It produces longer code, but also gives the compiler greater freedom to optimize.

using shared memory helped a little, thanks for the advise, but what do you mean with inlining fabs ?

You’re using the index variables i and j a lot. Try declaring them as volatile int instead of just int. It forces the compiler to assign them to a register immediately. This works a treat on Compute Capability 1.x - I can’t guarantee that it will help also in your case.

Christian