Local Memory use depending on re-ordering of algorithm?

mjmawson · February 26, 2013, 11:18am

Hi all,
I was wondering if anyone could tell me why these two implementations of a lattice Boltzmann solver have very different performances (but not results), excuse some of the comments as they code is copied directly from my program:

Implementation 1:

for(l=0;l<19;l++){
                        Xnew=x-cx[l];
			Ynew=y-cy[l];
			Znew=z-cz[l];
			//Standard Streaming i.e. no implicit periodicity
			if(Xnew>=0 && Xnew<Nx && Ynew>=0 && Ynew<Ny && Znew>=0 && Znew<Nz){
				stream_coll.f[l]=f[(l*NxPitch*Ny*Nz)+(Znew*Ny*NxPitch)+(Ynew*NxPitch)+Xnew];	
			}else{
				stream_coll.f[l]=f[(l*NxPitch*Ny*Nz)+(z*Ny*NxPitch)+(y*NxPitch)+x];	
			}
}

//SOME CODE THAT IS IDENTICAL IN BOTH VERSIONS

t1=stream_coll.u*stream_coll.u+stream_coll.v*stream_coll.v+ stream_coll.w*stream_coll.w;
		for(l=0;l<19;l++){
			t2=stream_coll.u*cx[l]+stream_coll.v*cy[l]+stream_coll.w*cz[l];

			f_red[(l*NxPitch*Ny*Nz)+(z*Ny*NxPitch)+(y*NxPitch)+x]=stream_coll.f[l]*(1.0f-omega)+stream_coll.rho*weight[l]*(1.0f+3.0f*t2+4.5f*(t2*t2)-1.5f*t1)*omega;
}

Implementation 2:

for(l=0;l<19;l++){
stream_coll.f[l]=f[(l*NxPitch*Ny*Nz)+(z*Ny*NxPitch)+(y*NxPitch)+x];
}

//SOME IDENTICAL CODE	

	t1=stream_coll.u*stream_coll.u+stream_coll.v*stream_coll.v+ stream_coll.w*stream_coll.w;

		for(l=0;l<19;l++){

			t2=stream_coll.u*cx[l]+stream_coll.v*cy[l]+stream_coll.w*cz[l];
			Xnew=x+cx[l];
			Ynew=y+cy[l];
			Znew=z+cz[l];

			//Standard Streaming i.e. no implicit periodicity
			if(Xnew>=0 && Xnew<Nx && Ynew>=0 && Ynew<Ny && Znew>=0 && Znew<Nz){
			f_red[(l*NxPitch*Ny*Nz)+(Znew*Ny*NxPitch)+(Ynew*NxPitch)+Xnew]=stream_coll.f[l]*(1.0f-omega)+stream_coll.rho*weight[l]*(1.0f+3.0f*t2+4.5f*(t2*t2)-1.5f*t1)*omega;	
			}else{
			f_red[(l*NxPitch*Ny*Nz)+(z*Ny*NxPitch)+(y*NxPitch)+x]=stream_coll.f[l];
			}
}

In the second implementation, some values are stored in local memory rather than registers, and I can’t figure out why. This is all done with CUDA 5 compiling in compute capability 3.0. Thanks.

njuffa · February 26, 2013, 4:55pm

It’s a bit hard to see. I assume you are not talking about local memory usage caused by spilling (you can get information about spills from -Xptxas -v)? It seems your code contains a small local array. Such local arrays are by default allocated in local memory (since they are thread local). If the loop with known trip count gets unrolled, indices into that array become compile time constants, and the array elements can be treated like scalars. Data from the local array can thus be placed into registers. This is an optimization. If my hypothesis is correct, you should see significantly higher register usage for the variant that does not use local memory.

There are various heuristics at play. The compiler unrolls loops subject to code size restrictions among other things, but you can try to force it with a #pragma unroll 19. If the local array is too large, or register pressure is high, the compiler will leave the array in local memory rather than using registers even if the indices are all compile-time constant.

You could inspect the intermediate PTX file (retain it with -keep) to check how the two variants get translated.

mjmawson · February 28, 2013, 12:02pm

You were spot on. for whatever reason f is unrolled into registers in the first example but not unrolled and spilled into local memory in the second. Adding the #pragma unroll removed the issue. Many thanks.

njuffa · February 28, 2013, 6:09pm

Glad to hear it all worked out. Just to clarify, when the compiler decides to leave a thread-local array in local memory, this is not spilling. The output from -Xptxas -v should make that clear, as it specifically states how many bytes were involved in spill operations, in addition to showing local memory usage.

Once the compiler has decided which variables to place into registers it may find out at a later stage that at certain points in the code the number of live variables assigned to registers exceeds the available registers. At such points it temporarily stores out some variables to lcoal memory, then reads them back later. This is spilling. Note that code optimization made by the compiler itself can increase the register pressure, such as the creation of induction variables in loops.

mjmawson · February 28, 2013, 10:56pm

Good point, that’s an important difference to note. Thanks for your help.