How to force nvcc to use registers instead of shared memory? Need help to understand compiler option

elegant-math · March 15, 2009, 6:12pm

Hi, all,

I got interesting problem. I wrote an artificial example to measure shmem bank conflicts and find interesting nvcc optimization behavior.

My kernel is organizing such a way that I load 32 floats form shared memory to registers and use them for the uncoleased shared memory calculations.

Actually, the compiler recognize that the data for these registers stay constantly in shared memory and skip this load to registers (optimizing the total amount of registers). It significantly drops the performance, because the algorithm began bank conflicts.

For my example I find a solution - to put some synchronization in between but I cannot guarantee that I can do it in any other example. The kernel is below:

#define NB 32

void KERNEL testkernel(const int M, float *AA)

 { int	i, inc, jx, jy;

   float *pS;

   float T00, T01, T02, T03, T04, T05, T06, T07, T08, T09;

   float T10, T11, T12, T13, T14, T15, T16, T17, T18, T19;

   float T20, T21, T22, T23, T24, T25, T26, T27, T28, T29;

   float T30, T31;

__shared__ float SA[1024], SB[1024];

jx=MYTHID/NB;

   inc=NTHREADS/NB;

   for(i=MYTHID; i<NB*NB; i+=NTHREADS)

	 SA[i]=AA[NB*NB*MYBLKID+i];

   __syncthreads();

   for(i=0; i<M; i++)

   { pS=SA+jx*NB;

	 T00=pS[ 0]; T01=pS[ 1]; T02=pS[ 2]; T03=pS[ 3]; T04=pS[ 4];

	 T05=pS[ 5]; T06=pS[ 6]; T07=pS[ 7]; T08=pS[ 8]; T09=pS[ 9];

	 T10=pS[10]; T11=pS[11]; T12=pS[12]; T13=pS[13]; T14=pS[14];

	 T15=pS[15]; T16=pS[16]; T17=pS[17]; T18=pS[18]; T19=pS[19];

	 T20=pS[20]; T21=pS[21]; T22=pS[22]; T23=pS[23]; T24=pS[24];

	 T25=pS[25]; T26=pS[26]; T27=pS[27]; T28=pS[28]; T29=pS[29];

	 T30=pS[30]; T31=pS[31];

//   __syncthreads();

	 for(jy=MYTHID-jx*NB; jy<NB; jy+=inc)

	 { pS=SA+jy;

	   SB[jx*NB+jy]=T00*pS[ 0   ]+T01*pS[   NB]+T02*pS[ 2*NB]+T03*pS[ 3*NB]+T04*pS[ 4*NB]+

					T05*pS[ 5*NB]+T06*pS[ 6*NB]+T07*pS[ 7*NB]+T08*pS[ 8*NB]+T09*pS[ 9*NB]+

					T10*pS[10*NB]+T11*pS[11*NB]+T12*pS[12*NB]+T13*pS[13*NB]+T14*pS[14*NB]+

					T15*pS[15*NB]+T16*pS[16*NB]+T17*pS[17*NB]+T18*pS[18*NB]+T19*pS[19*NB]+

					T20*pS[20*NB]+T21*pS[21*NB]+T22*pS[22*NB]+T23*pS[23*NB]+T24*pS[24*NB]+

					T25*pS[25*NB]+T26*pS[26*NB]+T27*pS[27*NB]+T28*pS[28*NB]+T29*pS[29*NB]+

					T30*pS[30*NB]+T31*pS[31*NB];

	 }

	 __syncthreads();

	 pS=SB+jx*NB;

	 T00=pS[ 0]; T01=pS[ 1]; T02=pS[ 2]; T03=pS[ 3]; T04=pS[ 4];

	 T05=pS[ 5]; T06=pS[ 6]; T07=pS[ 7]; T08=pS[ 8]; T09=pS[ 9];

	 T10=pS[10]; T11=pS[11]; T12=pS[12]; T13=pS[13]; T14=pS[14];

	 T15=pS[15]; T16=pS[16]; T17=pS[17]; T18=pS[18]; T19=pS[19];

	 T20=pS[20]; T21=pS[21]; T22=pS[22]; T23=pS[23]; T24=pS[24];

	 T25=pS[25]; T26=pS[26]; T27=pS[27]; T28=pS[28]; T29=pS[29];

	 T30=pS[30]; T31=pS[31];

//   __syncthreads();

	 for(jy=MYTHID-jx*NB; jy<NB; jy+=inc)

	 { pS=SB+jy;

	   SA[jx*NB+jy]=T00*pS[ 0   ]+T01*pS[   NB]+T02*pS[ 2*NB]+T03*pS[ 3*NB]+T04*pS[ 4*NB]+

					T05*pS[ 5*NB]+T06*pS[ 6*NB]+T07*pS[ 7*NB]+T08*pS[ 8*NB]+T09*pS[ 9*NB]+

					T10*pS[10*NB]+T11*pS[11*NB]+T12*pS[12*NB]+T13*pS[13*NB]+T14*pS[14*NB]+

					T15*pS[15*NB]+T16*pS[16*NB]+T17*pS[17*NB]+T18*pS[18*NB]+T19*pS[19*NB]+

					T20*pS[20*NB]+T21*pS[21*NB]+T22*pS[22*NB]+T23*pS[23*NB]+T24*pS[24*NB]+

					T25*pS[25*NB]+T26*pS[26*NB]+T27*pS[27*NB]+T28*pS[28*NB]+T29*pS[29*NB]+

					T30*pS[30*NB]+T31*pS[31*NB];

	 }

	 __syncthreads();

   }

   for(i=MYTHID; i<NB*NB; i+=NTHREADS)

	 AA[NB*NB*MYBLKID+i]=SA[i];

   return;

 }

If I compile it for 260, I see in cubin only 8 registers and it runs with 256 threads with about 150GFlop/s performance, if I uncomment the synchronization, the register usage goes up to 40, and it shows about 210GFlop/s.

Is there way to force compiler to put some variables always on registers and do not optimize it out with the performance drop?

Thank you

Serge

T.B · March 16, 2009, 9:32am

Hi,

Have you tried making your variables volatile?

elegant-math · March 16, 2009, 10:19am

Hi,

Thank you, T.B. It solves the problem!

Serge