I was working on an optical propagation simulation when I ran into the following scenario. This basically sums up a bunch of propagated energies, and I have two major loops, let’s call them the inner and outer (each has an X and Y). The outer ‘loop’ is actually the threads launched, and computes a value (in the code, it’s oscale) that it needs to multiply every inner value by.
In the first cut of this program, I did indeed place the multiply there, but then I realized I could simply multiply it at the end of the sum for the inner loop and save myself a multiply in every thread. However, when I moved this multiply outside of the inner loop right before the summed value got stored back to global memory, the Profiler started reporting an additional 2048 uncoalesced Global loads (one for each thread, see the commented out line at the end). If I perform the store to global without multiplying by this scale factor, those uncoalesced loads go away. Don’t understand this since oscale is a register variable…Is there some sort of access collision problem I’m unaware of?
__global__ void integrateDet( float *intobj, const size_t iop, const float4 trans,
const float4 *pobj, const size_t obj_pitch,
const unsigned int owidth, const unsigned int oheight,
const unsigned int dwidth, const unsigned int dheight )
{
/* This function calculates the radius from the transmitter pixels sent in to the
object pixel for this thread, then uses this scaling across all detector pixels.
Texture fetches are used for detector positions since they are constant and used
by every thread in an orderly fashion. The summed value is then written back out
to the array holding the integrated signal per object pixel, where a reduce
function will handle the summation
*/
// Declare variables
const unsigned int x = IMAD(blockIdx.x,blockDim.x,threadIdx.x);
const unsigned int y = IMAD(blockIdx.y,blockDim.y,threadIdx.y);
if ( x<owidth && y<oheight ) {
float oscale,dscale,sum,rowsum;
float4 obj;
float3 delta;
// Get location of this object pixel
float4 *ptr=(float4*)((char*)pobj + IMUL(y,obj_pitch));
obj=ptr[x];
// Calculate distance from transmitter to object
delta.x = obj.x - trans.x;
delta.y = obj.y - trans.y;
delta.z = obj.z - trans.z;
// Get reciprocal distance squared
oscale = 1.f / (delta.x*delta.x + delta.y*delta.y + delta.z*delta.z);
// Square above and mutliply by object and transmitter value to get scaling factor for all reciever pixels
oscale = trans.w * obj.w * oscale * oscale;
// Loop over detector pixels
sum=0.f;
for (unsigned int j=0; j<dheight; j++ ) {
rowsum=0.f;
for (unsigned int i=0; i<dwidth; i++) {
// Load detector position
float4 det=tex2D(det_Tex, (float)i+0.5f, (float)j+0.5f);
delta.x = det.x-obj.x;
delta.y = det.y-obj.y;
delta.z = det.z-obj.z;
dscale = 1.0f / (delta.x*delta.x + delta.y*delta.y + delta.z*delta.z);
rowsum += dscale * dscale;
} // End for i
sum += rowsum;
} // End for j
float *fptr=(float*)((char*)intobj+IMUL(y, iop));
// fptr[x]=sum*oscale;
fptr[x]=sum;
} // End if in object
} // End integrateDet()