Unknown reason for Global load Line in question uses local register variable

I was working on an optical propagation simulation when I ran into the following scenario. This basically sums up a bunch of propagated energies, and I have two major loops, let’s call them the inner and outer (each has an X and Y). The outer ‘loop’ is actually the threads launched, and computes a value (in the code, it’s oscale) that it needs to multiply every inner value by.

In the first cut of this program, I did indeed place the multiply there, but then I realized I could simply multiply it at the end of the sum for the inner loop and save myself a multiply in every thread. However, when I moved this multiply outside of the inner loop right before the summed value got stored back to global memory, the Profiler started reporting an additional 2048 uncoalesced Global loads (one for each thread, see the commented out line at the end). If I perform the store to global without multiplying by this scale factor, those uncoalesced loads go away. Don’t understand this since oscale is a register variable…Is there some sort of access collision problem I’m unaware of?

__global__ void integrateDet(	float *intobj, const size_t iop, const float4 trans, 

								const float4 *pobj, const size_t obj_pitch,

								const unsigned int owidth, const unsigned int oheight,

								const unsigned int dwidth, const unsigned int dheight )


	/* This function calculates the radius from the transmitter pixels sent in to the 

		object pixel for this thread, then uses this scaling across all detector pixels.

		Texture fetches are used for detector positions since they are constant and used

		by every thread in an orderly fashion. The summed value is then written back out

		to the array holding the integrated signal per object pixel, where a reduce

		function will handle the summation


	// Declare variables

	const unsigned int x = IMAD(blockIdx.x,blockDim.x,threadIdx.x);

	const unsigned int y = IMAD(blockIdx.y,blockDim.y,threadIdx.y);

	if ( x<owidth && y<oheight ) {

		float oscale,dscale,sum,rowsum;

		float4 obj;

		float3 delta;

		// Get location of this object pixel

		float4 *ptr=(float4*)((char*)pobj + IMUL(y,obj_pitch));


		// Calculate distance from transmitter to object

		delta.x = obj.x - trans.x;

		delta.y = obj.y - trans.y;

		delta.z = obj.z - trans.z;

		// Get reciprocal distance squared

		oscale = 1.f / (delta.x*delta.x + delta.y*delta.y + delta.z*delta.z);

		// Square above and mutliply by object and transmitter value to get scaling factor for all reciever pixels

		oscale = trans.w * obj.w * oscale * oscale;

		// Loop over detector pixels


		for (unsigned int j=0; j<dheight; j++ ) {


			for (unsigned int i=0; i<dwidth; i++) {

				// Load detector position

				float4 det=tex2D(det_Tex, (float)i+0.5f, (float)j+0.5f);

				delta.x = det.x-obj.x;

				delta.y = det.y-obj.y;

				delta.z = det.z-obj.z;

				dscale = 1.0f / (delta.x*delta.x + delta.y*delta.y + delta.z*delta.z);

				rowsum += dscale * dscale;

			} // End for i

			sum += rowsum;

		} // End for j

		float *fptr=(float*)((char*)intobj+IMUL(y, iop));

//		fptr[x]=sum*oscale;


	} // End if in object

} // End integrateDet()

which version is your CUDA toolkit?

When you don’t multiply by oscale, the entire code computing oscale is likely going to be removed, so whatever uncoalesced loads you have in there, they are in the computation of oscale.

Toolkit/SDK version is 3.2. I understand what you’re saying, except that when I place that ‘* oscale’ inside the inner loop such that the line

rowsum += dscale * dscale;


rowsum += dscale * dscale * oscale;

…and I still don’t have the uncoalesced load issue.

So it doesn’t appear that the oscale calculation is leading to the uncoalesced load, but I may be wrong.

I attempted to paste a screenshot of three Profilier sessions here, but for some reason, the forum kept saying ‘You must enter a post.’ I’ve attached a PNG of this screenshot as an alternative. The three sessions from the Profiler are (top to bottom): (1) oscale multiplied by sum right at global write, (2) oscale not used anywhere in kernel, (3) oscale used in inner loop sum per element. The first row in each session is the kernel code included in the original post.