Converting a kernel from floats and ints to halfs is 6x slower

I am processing a 2D area of memory doing the traditional outer and inner loop, each looping over a different dimension. My original implementation took 6ms per kernel call, however after some work to convert this to use halfs which are supposedly faster (and I have had success in other kernels using them), I find it takes 37.5ms per kernel call!

I am using a Jetson Nano computer 5.3 and compiling with “-gencode arch=compute_53,code=sm_53 -maxrregcount=32” arguments for nvcc. “fast-math” only resulted in a slight speed up.

Stuff common to both kernels:


int width = 4;
int height = 360;
#define SQUARE(A)       ((A) * (A))

Here is the original kernel (slightly pseudo code for simplicity):

for(x_offset = 0; x_offset < width; x_offset++)
{
    int xSq = SQUARE(xConst - x_offset);
    
    for(y_offset = 0; y_offset < height; y_offset++)
    {
        int ySq = SQUARE(yConst - y_offset);
        int sumSq = xSq + ySq;
        float distanceFromCentre = sqrtf(sumSq);
        float correction = (1.0749947E-6 * sumSq) - (0.000297173 * distanceFromCentre) + 1.01820957;
        float pixelVal = (float)pY[offset] * correction;
    }   
}

And here is the version using halfs, with the times each line took in the comments. I tried to use intrinsics as much as possible. The use of “mult” is to prevent overflow when certain numbers are squared.

const half mult = 0.1;
half a = __float2half(1.0749947E-4);
half b = __float2half(-0.00297173);
half c = __float2half(1.01820957);
half xConst = __short2half_ru((1280 / 2) - pixelCol) * mult;
half yConst = __short2half_ru((int)(720 / 2) - pixelRow) * mult;

int width = 4;
int height = 360;

for(x_offset = 0; x_offset < width; x_offset++)
{
    half xTemp = __hfma(__short2half_ru(x_offset), mult, -xConst);
    half xSq = SQUARE(xTemp);
    
    for(y_offset = 0; y_offset < height; y_offset++)
    {
        half yTemp = __hfma(__short2half_ru(y_offset), mult, -yConst);      // 8.7ms
        half sumSq = __hfma(yTemp, yTemp, xSq);                             // 3ms
        half distanceFromCentre = hsqrt(sumSq);                             // 2.2ms
        half correction = (a * sumSq) + __hfma(b, distanceFromCentre, c);   // 11ms    
        half pixelVal = __short2half_ru(pY[offset]) * correction;           // 5.8ms 
    }   
}

My code runs fine and the outputs are almost identical. What I’m asking is why is the second implementation so incredibly slower than the first?

Thanks for looking.

You’ve provided code that I can’t even compile, much less run. So my comments will be fairly limited.

  1. Where possible, the full half throughput (whatever it may be) is often only achieved when processing half2 quantities.
  2. Some of the benefit of half (at least compared to 4-byte quantities like float or int) is in the loading, storing, and movement of data. If you are loading int or float data into the kernel, then converting to half, you are probably missing some of the benefit.

I’m not suggesting either of these things are possible in your case. I’m also sure that this is in no way a complete explanation for anything at all, such as 6x perf difference. I generally try to avoid making judgements or doing much work when there is not a short, complete test case to work with. Perhaps others will be able to comment or help.

1 Like

If that is your actual code, the use of double-precision literal constants is going to cause the computation to be performed in double precision, which will drag down performance. You would want to add an f suffix to the constants so they have type float:

float correction = (1.0749947E-6f * sumSq) - (0.000297173f * distanceFromCentre) + 1.01820957f

The half-precision version of the code would seem to be negatively impacted by __short2half_ru() calls. Conversions take time, they may also interfere with some compiler optimizations. Can you change Y[], and pY[] to half precision?

The Jetson Nano has compute capability 5.3 and therefore should have full FP16 throughput per section 5.4.1 of the CUDA Programming Guide. From the limited amount of code shown there does not see an obvious reason for a 6x slowdown in the half-precision version of the code. I would double-check the compiler settings (is the code built with full optimization) and see what the CUDA profiler has to say about bottlenecks in the two versions, which should provide a clue as to what is going on.

1 Like

Thanks for both replies. I will post a test case shortly.

The code is semi-pseudo code, sorry if I didn’t make it clear enough.

p[Y] is a buffer to an EGL frame with uin8_t luminance data so no chance to change that. The code corrects the reduction in luminance at the image edges caused by a poor quality camera lens in case you were wondering of the purpose of this code.

Thanks for the f suffix, that saved me 0.4ms in the non-half version of the code.

That is a rather small reduction in runtime given that this occurred in the innermost loop. This would seem to indicate that the performance of this code is mostly limited by memory accesses. I would strongly suggest using the CUDA profiler to determine the bottleneck(s) in this kernel with certainty.

1 Like

You know, I would love to use the profiler but it’s not supported on the Jetson Nano. I have made a bare-minimum example below. Just my luck that the example does not show that the use of halfs is dramatically slower than floats/ints as it is in my application. :-(

BTW I can verify visually with the output video that the results from each kernel are very similar. I have double checked that the example code matches that in my application.

I should point out that in reality, pY is an EGL frame from the nvivafilter gstreamer plugin with the args “cuda-process=true” so memory layout is probably different than simply using cudaMalloc. (The nvivafilter plugin provides an EGLImageKHR object from which the raw data is obtained using cuGraphicsEGLRegisterImage and then cuGraphicsResourceGetMappedEglFrame.).

EDIT: I am aware of the benefit of using half2 however as it results in less readable code I am working with halfs in my application until I get the kernel running faster than floats/ints.

nano@jetson-nano:/home/nano$ /usr/local/cuda-10.2/bin/nvcc -gencode arch=compute_53,code=sm_53 forum.cu 
nano@jetson-nano:/home/nano$ ./a.out 
int/float took 3.5ms.
half took 2.8ms.

And the code:

#include <cuda.h>
#include <cuda_fp16.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/time.h>

const int imageWidth = 1280;
const int imageHeight = 720;

// The "image" data pY is split into rectangles of the following dimension, with each rectangle processed by its own CUDA thread.
// There are 320 rectangles width-wise and 6 rectangles height-wise to perfectly cover the image area.
const int threadWidth = 4;
const int threadHeight = 120;

#define SQUARE(A)       ((A) * (A))

uint64_t get_time_usec(void)
{
	static struct timeval _time_stamp;
	gettimeofday(&_time_stamp, NULL);
	return (_time_stamp.tv_sec * 1000000ll) + _time_stamp.tv_usec;
}


__global__ void kernel1(uint8_t* pY)
{
    int cudaRow = blockIdx.y * blockDim.y + threadIdx.y;
    int pixelRow = cudaRow * threadHeight;
    int cudaCol = blockIdx.x * blockDim.x + threadIdx.x;
    int pixelCol = cudaCol * threadWidth;
    
    int xConst = (imageWidth / 2) - pixelCol;
    int yConst = (imageHeight / 2) - pixelRow;
    
    for(int x_offset = 0; x_offset < threadWidth; x_offset++)
    {
        int xSq = SQUARE(xConst - x_offset);
        
        for(int y_offset = 0; y_offset < threadHeight; y_offset++)
        {
            uint32_t offset = (pixelRow + y_offset) * imageWidth + pixelCol + x_offset;
            
            int ySq = SQUARE(yConst - y_offset);
            int sumSq = xSq + ySq;
            float distanceFromCentre = sqrtf(sumSq);
            float correction = (1.0749947E-6f * sumSq) - (0.000297173f * distanceFromCentre) + 1.01820957f;
            float pixelVal = (float)pY[offset] * correction;
            
            pY[offset] = (uint8_t)pixelVal;
        }
    }
}


__global__ void kernel2(uint8_t* pY)
{
    int cudaRow = blockIdx.y * blockDim.y + threadIdx.y;
    int pixelRow = cudaRow * threadHeight;
    int cudaCol = blockIdx.x * blockDim.x + threadIdx.x;
    int pixelCol = cudaCol * threadWidth;
    
    // mult is necessary otherwise when the numbers are squared later on to calculate distanceFromCentre there will be an overflow on half.
    half mult = 0.1;
    half a = __float2half(1.0749947E-4);
    half b = __float2half(-0.00297173);
    half c = __float2half(1.01820957);
    half xConst = __short2half_ru((imageWidth / 2) - pixelCol) * mult;
    half yConst = __short2half_ru((int)(imageHeight / 2) - pixelRow) * mult;

    for(int x_offset = 0; x_offset < threadWidth; x_offset++)
    {
        half xTemp = __hfma(__short2half_ru(x_offset), mult, -xConst);
        half xSq = SQUARE(xTemp);
        
        for(int y_offset = 0; y_offset < threadHeight; y_offset++)
        {
            uint32_t offset = (pixelRow + y_offset) * imageWidth + pixelCol + x_offset;
            
            // Use a quadratic equation to adjust the value in pY.
            half yTemp = __hfma(__short2half_ru(y_offset), mult, -yConst);   
            half sumSq = __hfma(yTemp, yTemp, xSq);                          
            half distanceFromCentre = hsqrt(sumSq);                          
            half correction = (a * sumSq) + __hfma(b, distanceFromCentre, c);
            half pixelVal = __short2half_ru(pY[offset]) * correction;        
            
            pY[offset] = __half2short_ru(pixelVal);
        }
    }
}


int main(void)
{
    dim3 blocks = 2; // max on Jetson Nano
    dim3 threadsPerBlock = dim3(160, 6);    // 960 threads per block - max is 1024 on Jetson Nano
    uint8_t* pY;
    
    cudaMalloc(&pY, imageWidth * imageHeight * sizeof(uint8_t));
    
    uint64_t start = get_time_usec();
    
    kernel1<<<blocks, threadsPerBlock>>>(pY);
    cudaDeviceSynchronize();
    
    printf("int/float took %.1fms.\n", (double)(get_time_usec() - start) / 1000.0);
    
    start = get_time_usec();
    
    kernel2<<<blocks, threadsPerBlock>>>(pY);
    cudaDeviceSynchronize();
    
     printf("half took %.1fms.\n", (double)(get_time_usec() - start) / 1000.0);
     
     return 0;
}

Many thanks.

Even the simple command-line profiler nvprof does not work on the Jetson Nano? If memory serves, nvprof was still included in CUDA 10.2.

I know next to nothing about NVIDIA’s embedded products, which is why I usually send forum participants with Jetson issues to the designated sub-forums for these platforms.

I took a quick look at the disassembly for kernel1 and kernel2, but nothing jumped out at me that would explain why kernel2 would be slower than kernel, let alone by a large factor.

I notice belatedly that according to the latest update, the half version is in fact faster than the float version, as expected:

int/float took 3.5ms.
half took 2.8ms.

So it looks like we are all done?

If there are significant performance differences between this standalone version and the full application, there must be a logical explanation. Are there differences in compiler settings, for example? Are you using separate compilation for the full app? Are the grid and block configurations identical? Could there be issues with the timing methodology that causes time to be attributed to the wrong code (e.g. accidentally included initialization overhead or copy overhead with the kernel)?

Are the image dimensions constant ? If it is always 4 consecutive bytes per thread, and the pY pointer is at least aligned to 4 bytes, I would suggest loading and storing 4 bytes via vectorized loads.

For example like this:

__global__ void kernel3(uint8_t* pY)
{
    int cudaRow = blockIdx.y * blockDim.y + threadIdx.y;
    int pixelRow = cudaRow * threadHeight;
    int cudaCol = blockIdx.x * blockDim.x + threadIdx.x;
    int pixelCol = cudaCol * threadWidth;
    
    int xConst = (imageWidth / 2) - pixelCol;
    int yConst = (imageHeight / 2) - pixelRow;

    //assuming threadWidth == 4;

    for(int y_offset = 0; y_offset < threadHeight; y_offset++)
    {
        uint32_t offset = (pixelRow + y_offset) * imageWidth + pixelCol;
        uint8_t myPY[4];
        *((char4*)(&myPY[0])) = *((const char4*)(&pY[offset]));

        for(int x = 0; x < 4; x++){
          int xSq = SQUARE(xConst - x);
          int ySq = SQUARE(yConst - y_offset);
          int sumSq = xSq + ySq;
          float distanceFromCentre = sqrtf(sumSq);
          float correction = (1.0749947E-6f * sumSq) - (0.000297173f * distanceFromCentre) + 1.01820957f;
          float pixelVal = (float)myPY[x] * correction;
          myPY[x] = (uint8_t)pixelVal;
        }
        *((char4*)(&pY[offset])) = *((const char4*)(&myPY[0]));
        
        
    }
}

When executed on an RTX 3090, I measured the following timings using cudaEvent API.

kernel1 (float) elapsed: 0.108544
kernel2 (half) elapsed: 0.094432
kernel3 (float with 4byte loads) elapsed: 0.066464

Improving the memory access appears to be more important than faster computations.

1 Like

I checked the compiler settings and they are identical. The CUDA kernel is compiled into a .so file so it can be used with the gstreamer nvivafilter plugin. Grid and block dimensions are identical in both cases.

I’m going to check code outside the kernels to see if there is a difference there. There must be.

Sorry I got it wrong; only unified memory profiling is not supported.

(Reposted with better patch)

Thanks for this suggestion to use 4 byte vectorised loads. I tried it and found it saved 1.0ms in the sample code although in my application code it only saves 0.1-0.2ms.

int/float took 3.7ms.
half took 3.0ms.
vectorised loads took 2.7ms.

I ran nvprof on the sample code including vectorised loads and it worked, with the output closely reflecting the printf timing of the sample code. i.e. nvprof looks good. Sadly it does not run with my application and gives this error at the start of the nvprof output:

Invalid MIT-MAGIC-COOKIE-1 keynvbuf_utils: Could not get EGL display connection

This is related to a different camera capture system that works fine when I don’t use nvprof (it’s from the 12_camera_v4l2_cuda l4t demo collection in case you wondered). I’m not too enthused to look into this unless it is likely to pin down the problem I have…

I made a final sanity check to understand why things were slower with halfs. I built my application that used the ints/floats kernel (no vectorised load for simplicity) and saw a ~3ms kernel duration. I then applied the following patch (built from clean) to make the kernel use halfs and saw the massive increase in kernel duration to 32ms.

I can say with 100% certainty and my hand on my heart that this diff represents the only changes involved in using halfs in my application and something or some side effect of these changes is increasing the kernel duration from 3ms to 32ms. The diff may look slightly different to the sample code but it’s doing pretty much the same thing; a quadratic equation to create a number to multiply a uint8_t value.

(Makefiles/compiler args are identical too)

diff --git a/samples/nvsample_cudaprocess.cu b/samples/nvsample_cudaprocess.cu
index 8124e07..603c3d0 100755
--- a/samples/nvsample_cudaprocess.cu
+++ b/samples/nvsample_cudaprocess.cu
@@ -32,6 +32,7 @@
 #include <stdlib.h>
 #include <sys/time.h>
 #include <cuda.h>
+#include <cuda_fp16.h>
 #include <pthread.h> 
 #include <list>
 
@@ -361,24 +362,32 @@ __global__ void ProcessFrame(const CudaFrameInfo cu)
     int pixelCol = cudaCol * cu.rectWidth;
     int x_offset, y_offset;
     uint32_t offset;
+
+    // mult is necessary otherwise when the numbers are squared later on to calculate distanceFromCentre there will be an overflow on half.
+    half mult = 0.1;
+    half xConst = __short2half_ru((cu.fbWidth / 2) - pixelCol) * mult;
+    half yConst = __short2half_ru((int)(cu.cudaResY / 2) - pixelRow) * mult;
     
-    int xConst = (cu.fbWidth / 2) - pixelCol;
-    int yConst = (cu.cudaResY / 2) - pixelRow;
-    
+    // The quadratic multipliers from the line of best fit, taking into account mult.
+    half a = __float2half(1.0749947E-4);
+    half b = __float2half(-0.00297173);
+    half c = __float2half(1.01820957);
+
     for(x_offset = 0; x_offset < cu.rectWidth; x_offset++)
 	{
-        int xSq = SQUARE(xConst - x_offset);
+        half xTemp = __hfma(__short2half_ru(x_offset), mult, -xConst);
+        half xSq = SQUARE(xTemp);
         
 		for(y_offset = 0; y_offset < cu.rectHeight; y_offset++)
 		{
-            int ySq = SQUARE(yConst - y_offset);
-            
             offset = (pixelRow + y_offset) * cu.fbWidth + pixelCol + x_offset;
             
-            int sumSq = xSq + ySq;
-            float distanceFromCentre = sqrtf(sumSq);
-            float correction = (1.0749947E-6f * sumSq) - (0.000297173f * distanceFromCentre) + 1.01820957f;
-            float pixelVal = (float)pY[offset] * correction;
+            // Using an equation, correct for the darkening of the image as you go further from the centre.
+            half yTemp = __hfma(__short2half_ru(y_offset), mult, -yConst);
+            half sumSq = __hfma(yTemp, yTemp, xSq);
+            half distanceFromCentre = hsqrt(sumSq);
+            half correction = (a * sumSq) + __hfma(b, distanceFromCentre, c);
+            half pixelVal = __short2half_ru(pY[offset]) * correction;
         }
     }

The truth is somewhere in this diff!

I highly recommend reviewing the SASS code.

I would also remove the —maxregcount=32. This is likely hampering the optimization.

1 Like

@Greg raises any excellent point, one that I overlooked when I inspected the generated SASS code earlier (meaning I compiled without -maxrregcount=32).

Generally speaking, -maxrregcount and __launch_bounds() were mechanisms introduced into CUDA for coping with register-starved GPU architectures and immature compilers in the early half dozen years of CUDA’s existence.

For the past decade, GPUs have not been register starved and the CUDA toolchain has been based on LLVM technology, making the use of both -maxrregcount and __launch_bounds() not only not necessary but frequently counterproductive. There can be rare exceptions to this rule of thumb but I have not personally come across any.

The heuristics of the CUDA compiler are well adjusted for near optimal performance, with occupancy balanced with optimizations that tend to increase register pressure.

1 Like

Hi there, unfortunately I have found the “—maxregcount=32” compiler argument to be vital for a different kernel in which not using it resulted in strange behaviour: Weird CUDA problem: changing += to /= in a loop causes a variable not to be set - #11 by AastaLLL
I tried compiling without the “—maxregcount=32” anyway and while this particular kernel ran fine, there was no speed up.

I would like to mention something in case it matters. The diff above still stands; applying that diff and only that diff causes the increased delay effects, but in order to get the code with halfs compiling I had to make sure I was not compiling for older architectures. Originally I had these in my nvcc args:

GENCODE_SM30    := -gencode arch=compute_30,code=sm_30
GENCODE_SM32    := -gencode arch=compute_32,code=sm_32
GENCODE_SM35    := -gencode arch=compute_35,code=sm_35
GENCODE_SM50    := -gencode arch=compute_50,code=sm_50
GENCODE_SMXX    := -gencode arch=compute_50,code=compute_50
GENCODE_SM53    := -gencode arch=compute_53,code=sm_53 -maxrregcount=32
GENCODE_SM62    := -gencode arch=compute_62,code=sm_62
GENCODE_SM72    := -gencode arch=compute_72,code=sm_72
GENCODE_SM_PTX  := -gencode arch=compute_72,code=compute_72

This caused compilation to fail, presumably as older architectures do not support halfs. This is the error I saw:

nvsample_cudaprocess.cu(368): error: more than one conversion function from "__half" to a built-in type applies:
            function "__half::operator float() const"
            function "__half::operator short() const"
            function "__half::operator unsigned short() const"
            function "__half::operator int() const"
            function "__half::operator unsigned int() const"
            function "__half::operator long long() const"
            function "__half::operator unsigned long long() const"
            function "__half::operator __nv_bool() const"

As for the SASS code, I dumped it but the assembly is 2000 lines long for the half implementation and 1000 lines long for floats/halfs. I’ve attached them in case they’re of help. These are the CUDA objdumps of the application version of the function before and after applying the patch I posted above.

kernel_floatint.txt (85.5 KB)
kernel_half.txt (183.5 KB)

Cheers.

(edited formatting)

I think I need to apologise… it turns out I was building my application with “-G” passed to nvcc. I am now seeing similar execution times with non-vectorised half as with vectorised float/int. I’ve learnt a lot from this thread but the problem I thought existed only exists when “-G” is enabled. Sorry for not making this clear earlier.

As for why “-G” causes half code to be massively slower than float/int… I think it would still be interesting to hear the answer to this question but it’s not vital for me to continue. Here is the result of “-G” on the sample code:

nano@jetson-nano:/home/nano$ /usr/local/cuda-10.2/bin/nvcc -G -gencode arch=compute_53,code=sm_53 forum.cu 
nano@jetson-nano:/home/nano$ ./a.out 
int/float took 28.1ms.
half took 164.1ms.
vectorised loads took 8.4ms.

In debug builds, all optimizations are disabled, and the compiler may even apply what I like to call “pessimizations”. This is done to allow the trackability of all variables at all times and to provide for precise breakpoints relative to source code.

If you look at the machine code generated by a release build, you will find: (1) Code from a single line of source code may be splintered into isolated instructions loosely mixed with dozens of other instructions (2) Code from some source lines may have disappeared altogether (3) Code from various inlined functions is intensely mixed up with calling code (4) The same variable may be held in different register during its life time (5) Some variables from source code may have been eliminated while (6) the compiler has created a bunch of new variables that are not there in the source code.

Debug builds should never be used for benchmarking or any other kind of performance work.

1 Like