I’ve noticed that the 32-bit toolkit generates slightly more optimal code than its bigger 64-bit brother. Mainly, the shared memory footprint is smaller. Sometimes, the register usage is smaller by one or two registers, in which case, the increased occupancy brings inimaginable performance gains (anywhere between 10 and 15%).
On one particular kernel, the two compilers produce the following outputs (taken from the .cubin file):
32-bit:
lmem = 12
smem = 2096
reg = 20
bar = 1
64-bit:
lmem = 12
smem = 2112
reg = 21
bar = 1
That’s 16 extra bytes of shared memory, and one extra register for doing the exact same thing…
I’ve posted below the kernel that produces these results.
After an intense struggle to compare the ptx files from the two different versions, I found that (correct me if I’m wrong) in the 64-bit code, pointers to device mempry are treated as 64-bit long, which would explain the larger shared memory usage, and maybe the higher register abuse. Of course, that is complete utter bulls*it, considering that the G80, G84, G86, G92, G200, and all the others that I’ve missed use a 32-bit memory space.
Here’s the kernel (the template parameter, LineSteps, is <16>):
[codebox]
// This union allows a kernel to use the same shared memory for three different arrays
// This is possible because the kernel will only use one array at a time
union kernelData
{
// The order of this array is specifically reversed in order [y][x] in order to produce less indexing
// overhead when being read from [y][0] to [y][BLOCK_X_MT - 1]
pointCharge<float> charge[BLOCK_Y_MT][BLOCK_X_MT];
// A shared array of accumulators is also needed for the final summation step
Vector3<float> smTemp[BLOCK_X_MT][BLOCK_Y_MT];
// A shared array of points allows the starting point to be read and computed once per column (ty==0)
Vector3<float> smPoint[BLOCK_X_MT];
};
template
global void CalcField_MTkernel(float2* xyInterleaved, float* z, pointCharge *Charges,
unsigned int n, unsigned int p, unsigned int fieldIndex, float resolution)
{
unsigned int tx = threadIdx.x;
unsigned int ty = threadIdx.y;
unsigned int ti = blockDim.x * blockIdx.x + tx;
// Using a unoin between all needed data tyoes allows massive smem economy
__shared__ kernelData kData;
// previous point ,used to calculate current point, and cumulative field vector
Vector3<float> point, temp;
float2 ptXY;
if(!ty)
{
// Load starting point
// The field vectors are arranged as structure of arrays in order to enable coalesced reads
// The x and y coordinates are interleaved in one array, producing coalesced 64-byte reads,
// and the z coordinates are placed in a separate array, producing coalesced 32-byte reads
ptXY = xyInterleaved[n * (fieldIndex - 1) + ti];
// Once the xy coordinates are read, place them in the appriopriate variable
point.x = ptXY.x;
point.y = ptXY.y;
// Now read the z coordinate
point.z = z[n * (fieldIndex - 1) + ti];
// Place the point in shared memory for other threads to access
kData.smPoint[tx] = point;
}
for(unsigned int bigStep = 0; bigStep < LineSteps; bigStep ++)
{
// Number of iterations of main loop
// Recalculating the number of steps here, allows a while loop to be used rather than a for loop
// This reduces the register usage by one register, allowing a higher warp occupancy
unsigned int steps = (p + BLOCK_DIM_MT - 1) / BLOCK_DIM_MT;
// Reset the cummulative field vector
temp.x = temp.y = temp.z = 0;
// All starting points need to be loaded to smem, othwerwise, threads may read back wrong pint
__syncthreads();
// load the starting point
point = kData.smPoint[tx];
// equivalent to for (int i = 0; i < steps, i++) where steps is used as i
do{
// It is important to decrement steps independently, and outside the while condition for the register gain to happen
steps--;
// Load point charges from global memory
// The unused charges must be padded until the next multiple of BLOCK_X
kData.charge[ty][tx] = Charges[steps * BLOCK_DIM_MT + ty * BLOCK_X_MT + tx];
// Wait for all loads to complete
__syncthreads();
// Unrolling the following loop completely saves one register compared to when doing a partial unroll
// While performance-wise there is no benefit in a complete unroll, the saved register will enable
// a higher warp occupancy
#pragma unroll
for(unsigned int i = 0; i < BLOCK_X_MT; i++)
{
temp += electroPartField(kData.charge[ty][i], point); // ElectroPartFieldFLOP + 3 FLOPs
}
__syncthreads();
}while(steps);
// Now that each partial field vector is computed, it can be written to global memory
kData.smTemp[tx][ty] = temp;
// Before summing up all partials, the loads must complete
__syncthreads();
// The next section is for summing the vectors and writing the result
// This is to be done by threads with a y index of 0
if(!ty)
{
// The first sum is already in registers, so it is only necesary to sum the remaining components
#pragma unroll
for(unsigned int i = 1; i < BLOCK_Y_MT; i++)
{
temp += kData.smTemp[tx][i];
}
// Finally, add the unit vector of the field divided by the resolution to the previous point to get the next point
point += vec3SetInvLen(temp, resolution);// 13 FLOPs (10 set len + 3 add)
// The results must be written back as interleaved xy and separate z coordinates
ptXY.x = point.x;
ptXY.y = point.y;
xyInterleaved[n * fieldIndex + ti] = ptXY;
z[n * fieldIndex + ti] = point.z;
kData.smPoint[tx] = point;
fieldIndex ++;
}
}
}//*/
[/codebox]