I am processing a 2D area of memory doing the traditional outer and inner loop, each looping over a different dimension. My original implementation took 6ms per kernel call, however after some work to convert this to use halfs which are supposedly faster (and I have had success in other kernels using them), I find it takes 37.5ms per kernel call!
I am using a Jetson Nano computer 5.3 and compiling with “-gencode arch=compute_53,code=sm_53 -maxrregcount=32” arguments for nvcc. “fast-math” only resulted in a slight speed up.
Stuff common to both kernels:
int width = 4;
int height = 360;
#define SQUARE(A) ((A) * (A))
Here is the original kernel (slightly pseudo code for simplicity):
for(x_offset = 0; x_offset < width; x_offset++)
{
int xSq = SQUARE(xConst - x_offset);
for(y_offset = 0; y_offset < height; y_offset++)
{
int ySq = SQUARE(yConst - y_offset);
int sumSq = xSq + ySq;
float distanceFromCentre = sqrtf(sumSq);
float correction = (1.0749947E-6 * sumSq) - (0.000297173 * distanceFromCentre) + 1.01820957;
float pixelVal = (float)pY[offset] * correction;
}
}
And here is the version using halfs, with the times each line took in the comments. I tried to use intrinsics as much as possible. The use of “mult” is to prevent overflow when certain numbers are squared.
const half mult = 0.1;
half a = __float2half(1.0749947E-4);
half b = __float2half(-0.00297173);
half c = __float2half(1.01820957);
half xConst = __short2half_ru((1280 / 2) - pixelCol) * mult;
half yConst = __short2half_ru((int)(720 / 2) - pixelRow) * mult;
int width = 4;
int height = 360;
for(x_offset = 0; x_offset < width; x_offset++)
{
half xTemp = __hfma(__short2half_ru(x_offset), mult, -xConst);
half xSq = SQUARE(xTemp);
for(y_offset = 0; y_offset < height; y_offset++)
{
half yTemp = __hfma(__short2half_ru(y_offset), mult, -yConst); // 8.7ms
half sumSq = __hfma(yTemp, yTemp, xSq); // 3ms
half distanceFromCentre = hsqrt(sumSq); // 2.2ms
half correction = (a * sumSq) + __hfma(b, distanceFromCentre, c); // 11ms
half pixelVal = __short2half_ru(pY[offset]) * correction; // 5.8ms
}
}
My code runs fine and the outputs are almost identical. What I’m asking is why is the second implementation so incredibly slower than the first?
Thanks for looking.