Float to Half Performance Improvement

In an effort to improve processing time, I recently converted one of my CUDA programs from using 32-bit floats to 16-bit half precision floats, I am using a Jetson Xavier AGX which should process half precision twice as fast as floats. This change did not seem to make a significant difference in processing time. Using the nsight system profiler, I saw that the majority of the programs time was spent with cudaMalloc and cudaFree, and the same percentage of time was spent in the API calls for both programs. I assumed that cudaMalloc and cudaFree would be twice as fast for the half precision compared to the 32-bit version as the half precision version would only be dealing with half the total bytes. Is this lack of performance difference between half and float inherent to CUDA or does this indicate there is an issue with my program?

To first order, the speed of cudaMalloc() and cudaFree() is independent of the size of the allocation, just as the speed of regular malloc() and free() is independent of the size of the allocation. These APIs manipulate internal control structures, they do not operate on the memory allocated. This work is performed by the driver on the host side, so using a CPU with higher single-thread performance (and possibly faster system memory) will help.

Use of 16-bit half-precision instead of 32-bit single-precision will benefit performance if an application is bound by memory throughput. If you use half2 data, applications limited by computational throughput can also benefit.

Thank you for the reply njuffa.

Seeing that using half won’t necessarily speed up the memory allocation, is there anyway to pre-allocate GPU memory so that it doesn’t have to be allocated dynamically with cudaMalloc? Or even manually provide addresses to cudaMalloc so it doesn’t have to search for empty memory?

The way you could do that is call cudaMalloc() once at the start of the program, then re-use that allocation throughout your program. You could even create your own little sub-allocator if you like. As long as there aren’t many memory blocks to be tracked, dynamic memory allocators, including cudaMalloc(), are typically fast as there isn’t much searching for free blocks.

Note: CUDA context creation occurs lazily on the first CUDA API call. In many CUDA programs, that is a call to cudaMalloc(). Depending on how you time your code, the overhead of context creation will be attributed to that first cudaMalloc() call. To separate the context creation time from the time taken by cudaMalloc(), a common practice is to trigger it with a call to cudaFree(0);