Register overspill Register overspill - help needed

Ok…

In general, i am confused. Why would i have a register overspill from shared memory?
I understand that if i use more registers than i should they will spill to local memory, but how comes it is the case with shared mem as well, and how could it be avoided?

I append some code in here, if you can have any ideas or recommendations, you are welcome. The kernel uses 256 threads and 368 blocks. The appended method spills 48 bytes.

device float checkBin_hst_vdw(volatile float4 *x_atomCoordsIn,
const float4 centralAtomIn, const int atomsInCell,
volatile vdwAtom_t vdwAt, volatile vdwAtom_t *vdwAts,
unsigned int loopStart, float d_cutoff) {

//	__shared__
volatile float dx;
volatile float dy;
volatile float dz;
volatile float rab2;

volatile float temp_e = 0.0f;
volatile float R_AB6;

volatile float R_AB;
float R_AB7;
float erep;
float rab7;
volatile float epsilon;
volatile float rab;
float eattr;
volatile float e_upper;
float sumsqs;


for (unsigned int position = loopStart; position < atomsInCell; position++) {



	dx = x_atomCoordsIn[position].x - centralAtomIn.x;
	dy = x_atomCoordsIn[position].y - centralAtomIn.y;
	dz = x_atomCoordsIn[position].z - centralAtomIn.z;

	rab2 = (dx * dx + dy * dy + dz * dz);


	if (rab2 < d_cutoff ) {

		//			vdwAtom_t vdwAts[l] = vdwAts[l];

		sumsqs = (vdwAt.vdwsqrt
						+ vdwAts[position].vdwsqrt);
		rab = sqrtf(rab2);
		R_AB = (vdwAt.R + vdwAts[position].R) * 0.5;


		R_AB6 = R_AB * R_AB * R_AB * R_AB * R_AB * R_AB;

		e_upper = (181.16 * vdwAt.G * vdwAts[position].G * vdwAt.alpha
								* vdwAts[position].alpha);

		epsilon = 0.5 * e_upper /( sumsqs* R_AB6);


		// hydrogen bond donor
		if (vdwAt.DA == 1) {

			// hydrogen bond acceptor
			if (vdwAts[position].DA == 2) {

				// R_AB is scaled to 0.8 for D-A interactions. The value used in the calculation of epsilon is not scaled.
				R_AB = 0.8 * R_AB;
				R_AB6 *= 0.262144;

			} else {
				epsilon += epsilon;
			}

		}
		// hydrogen bond donor
		else if (vdwAts[position].DA == 1) {

			// hydrogen bond acceptor
			if (vdwAt.DA == 2) {

				// R_AB is scaled to 0.8 for D-A interactions. The value used in the calculation of epsilon is not scaled.
				R_AB *= 0.8;
				R_AB6 *= 0.262144;
			} else {
				epsilon += epsilon;
			}

		} else {
			float g_AB= (vdwAt.R - vdwAts[position].R) / (2*R_AB);
			R_AB *= (1.0 + 0.2 * (1.0 - exp(-12.0 * g_AB * g_AB)));
			R_AB6 = R_AB * R_AB * R_AB * R_AB * R_AB * R_AB;
			epsilon =e_upper / (sumsqs * R_AB6);
		}
		R_AB7 = R_AB6 * R_AB;
		rab7 = rab * rab2 * rab2 * rab2;

		erep = (1.07 * R_AB) / (rab + 0.07 * R_AB); //***
		eattr = (((1.12 * R_AB7) / (rab7 + 0.12 * R_AB7)) - 2.0);

		temp_e += epsilon * erep * erep * erep * erep * erep * erep * erep
				* eattr;


		//------------------electrostatics-----------------
		rab += 0.05;

		temp_e
				+= ((332.0716 * x_atomCoordsIn[position].w
						* centralAtomIn.w) / rab);
		//-------------------------------------------------

	}
}
return temp_e;

}

(1) I don’t see any use of shared memory in this code. Could you attach a self-contained version that is compilable?
(2) Why do many variables in this code use the “volatile” qualifier?
(3) What toolchain (CUDA version) are you building with? What architecture (compute capability) are you building for? What flags are you passing to nvcc?
(4) What’s the exact output produced after adding -Xptxas -v to the nvcc command line?

Let me guess: You are compiling on a 64 bit system for a compute capability 1.x device.
What you are seeing isn’t register spilling, it is the memory that is used for kernel parameters.

(1) shared memory is passed as a parameter in the method:

volatile float4 *x_atomCoordsIn,

volatile vdwAtom_t *vdwAts,

i attached a compilable version of the code in this reply

(2) It is meant to instruct the compiler that the variable will be reused. Have read that it optimizes (?)

(3) What toolchain (CUDA version) are you building with? : 4.0

What architecture (compute capability) are you building for? : 2.0

What flags are you passing to nvcc? : debug and release version 

CONFIG(debug, debug|release) { 

  NVCC_FLAGS =  --ptxas-options=-v  -gencode arch=compute_20,code=sm_20 -g -G -Xcompiler

}

else {

  NVCC_FLAGS =  -arch=sm_20 -use_fast_math -Xcompiler

}

(4) What’s the exact output produced after adding -Xptxas -v to the nvcc command line? :

nvcc -c --ptxas-options=-v -gencode arch=compute_20,code=sm_20 -g -G -Xcompiler -pipe -I “.” -I “Data” -I “GL” -I “Math” -I “openbabel/math” -I “openbabel” -I “Commands” -I “Molecule” -I “Threads” -I “gpu_code” -I “gpu_code/neighbour_lists” -I “gpu_code/MMFF94s” -I “gpu_code/bonded_forces” -I “gpu_code/nonBonded_forces” -I “molsketch_helium” -I “openbabel/data” -I “openbabel/forcefields” -I “/usr/local/cuda/include” gpu_code/neighbour_lists/hs_t.cu -o ./hs_t_cuda.obj

ptxas info : Compiling entry function ‘_Z30createNeighboursList_t_8s_hst410stepCoords10vdw_params11spaceDecompPf’ for ‘sm_20’

ptxas info : Function properties for _Z30createNeighboursList_t_8s_hst410stepCoords10vdw_params11spaceDecompPf

88 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for Z11load_bucket10stepCoords10vdw_params11spaceDecompP6float4P7vdwAtomPtPjS7

64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for _Z16checkBin_hst_vdwPV6float4S_i7vdwAtomPVS2_jf

88 bytes stack frame, 52 bytes spill stores, 52 bytes spill loads

ptxas info : Used 42 registers, 18432+0 bytes smem, 96 bytes cmem[0], 5600 bytes cmem[2]

could you please give me an email address to send you a compilable part of the code?

Hope that helps…i am really looking forward to some good advice, as i have spent a couple of weeks reading books, manuals and random sites on how to avoid register spill, but still not happy…

Cheers,

Than

Sorry, I don’t see an attached file with the source code. It looks like you may have removed it later? You can also send me the code by attaching it to a PM sent through this forum.

It is curious that the compiler uses only 42 registers but spills, instead of using, say, all 63 registers available. Is there any use of -maxrregcount or __launch_bounds() that restricts register allocation? If not, this may be a case where the compiler heuristic that picks a target register count for optimal performance (a balance of occupancy and spilling) comes up with a suboptimal result. In my experience that happens infrequently but it can happen.

The semantics of “volatile” in C/C++ is that an object such marked can be modified asynchronously by an agent outside the scope of the code. Classical examples are memory-mapped status registers and objects updated by DMA or interrupt service routines. This means uses of “volatile” should occur very sparingly, and in most cases this will force the compiler to place the object into memory permanently to make sure it picks up modifications to the object caused by that asynchronous agent. So in general use of “volatile” tends to slow down code by preventing the object from being “cached” in a register.

Some people found that use of “volatile” in CUDA, in certain circumstances, can improve performance because as a side-effect of its use code generation is changed, and this can in some cases lead to reduced register pressure thus higher performance. This exploits an implementation artifact of a particular compiler component, and that component can and will change. I would recommend against such uses of “volatile”, and would suggest you remove all uses of “volatile” in this code.

Have you had a chance to try CUDA 4.1 (currently available as release candidate 2)? If you decide to give that a try, please remove all uses of “volatile” first.

Just a quick comment about some code optimization here. I see several instances like:
newvar = var * var * var * var * var * var;

I understand the _powf function is slow, but it might be faster to do something like:

newvar = var * var * var;
newvar *= newvar;

Additionally, consider substituting the faster (albiet less accurate) intrinsic floating point versions of exp (__fexpf?) and sqrt (__fsqrtf?) if you see that the results don’t differ much.

Note that the release build already uses -use_fast_math, which sustitutes the HW accelerated single-precision math functions. In terms of overall performance, it seems there are barely enough threads to keep a Fermi-class device saturated (ideally on would want 20K+ total threads). The data access pattern would also appear to be sub-optimal, as arrays are indexed by “position”, which is not derived from the thread index to follow the recommended base+tid indexing pattern.

The spilling may be a red herring as far as performance is concerned (for all we know the spilling may occur mostly outside the loop). But the question remains why the compiler picked a lower register count with spilling instead of a higher register count and fewer spills. The compiler’s choice may be optimal, but only building and running the code can tell us.

In terms of optimization it would be helpful to make this code “float clean” by ensuring that all literal floating-point constants are of type float. This will likely lead to a reduction in register pressure by making sure floating-point computation is performed in single precision (single register per operand) as opposed to double precision (register pair per operand). Due to C/C++ type promotion rules, just one double precision operand can cause much of the expression evaluation to be performed in double precision. Examples of such literal constants in the code are:

1.0 → 1.0f
0.2 → 0.2f
0.8 → 0.8f
0.262144 → 0.262144f
1.07 → 1.07f

Add the “f” suffix to make the literal constant float rather than double values.

I haven’t had to use volatile yet but I think I will for something I am starting on, couldn’t find anything in Programming guide 3.2 or Best Practices about it, can you let me know where I can find more info and examples of it please,

Thanks

There is no mention of it in the official documentation, as it is mostly based on anecdotal evidence from the forums. Try a forum search like volatile cbuchner1 site:forums.nvidia.com for most of it.

To paraphrase an old adage about optimization:

First rule of using “volatile”: Don’t!
Second rule of using “volatile” (experts only): Don’t use it yet!

Seriously, while I understand why programmer might have found the use of “volatile” attractive to work around various issues with suboptimal code generation in the past, I will strongly caution that the misuse of “volatile” beyond what it was intended for in C/C++, with the aim of exploiting compiler implementation artifacts to increase performance will come back to bite you as the compiler changes. This is a real-life and not merely a theoretical concern.

I would suggest that anybody struggling with significantly suboptimal code generation file a bug, attaching self-contained repro code that demonstrates the issue, so we can keep improving the compiler over time. Thank you for your help.

  • Post removed by myself until I make further experiments to support my claims *