I am relatively new to CUDA programming. I am transitioning an existing code base to run on GPUs. While tuning performance, I noticed that not compiling functions in the same source file that launches the kernel has a tremendous impact on the performance (30 - 40%).
These functions are large (2000+) lines of legacy code and they were originally compiled in their own source files. However, if I were to just include these functions in the same object file that launches the kernel, and compile one big object file, I see way better performance. However, this is horrible for code organization.
NVPROF indicates that the number of l2_local_load_bytes and l2_local_load_requests is greatly reduced when these functions are included in a single build. Further, the local_hit_ratio also significantly improves. However, the register count is not always effected as indicated by printing the GPU trace with NVPROF.
Since these functions were huge, I do not expect them to be inlined, and yet, somehow the register pressure is reduced.
I was wondering if there were certain flags I could throw into the compilation/linking phase that could allow me get the same performance benefits without having to have one giant object file.
Separately compiled object code requires functions to make use of the ABI, which implies restrictions on register usage and overhead from additional instructions for call, return, manipulating the stack frame. Also, very importantly, optimizations across compilation units cannot be performed unless the linker provides link-time optimization. These effects occur with all processor platforms, CPU or GPU.
Contrary to some CPU tool chains, currently available versions of CUDA do not ship with an optimizing linker. However, some link-time optimizations will apparently be supported in CUDA 11, per the following announcement:
The extent of link-time optimization capabilities in CUDA 11 is not known at this time. In the best case, separately compiled compilation units in conjunction with an optimizing linker would provide performance that is identical to a single compilation comprising the entire code. I consider it unlikely that that will be achieved with the very first release of an optimizing device-code linker, but one can hope.
@njuffa , thanks for your response, that makes sense; I wish the seriousness of the potential performance hit was better documented. For example, to quote the caveat section of this post -
“Performance of linked device code may also be a bit lower than the performance of device code built with full code path visibility at compile time”
Well, a 30 - 40 % performance hit is enormous! We spend years trying to get that kind of performance on our CPU code.
I accidentally discovered this issue when performance tuning. When I commented out all code related to memory loads, set some defaults and performed a bit of arithmetic, the overall cost of the function remained mostly unchanged. I spent a lot of time figuring out what was wrong, and it was only when I compared it against another function that had the same extent of logic, but was built with the kernel, did I suspect that something was off with not including this function with the kernel build.
I have no knowledge of the nature of your code other than the superficial description provided in the original post. Other effects than those directly due to separate compilation may be at play. It is possible that due to the nature of your code it is affected by the difference between separate compilation and whole-code compilation more than other codes.
In my experience working with the Intel tool chain for many years, the performance difference between compiling with and without /Qipo (cross-file interprocedural optimization) can easily be 20%. So in general, link-time optimization is a powerful performance-boosting tool.