I am relatively new to CUDA programming. I am transitioning an existing code base to run on GPUs. While tuning performance, I noticed that not compiling functions in the same source file that launches the kernel has a tremendous impact on the performance (30 - 40%).
These functions are large (2000+) lines of legacy code and they were originally compiled in their own source files. However, if I were to just include these functions in the same object file that launches the kernel, and compile one big object file, I see way better performance. However, this is horrible for code organization.
NVPROF indicates that the number of l2_local_load_bytes and l2_local_load_requests is greatly reduced when these functions are included in a single build. Further, the local_hit_ratio also significantly improves. However, the register count is not always effected as indicated by printing the GPU trace with NVPROF.
Since these functions were huge, I do not expect them to be inlined, and yet, somehow the register pressure is reduced.
I was wondering if there were certain flags I could throw into the compilation/linking phase that could allow me get the same performance benefits without having to have one giant object file.
This is only relevant topic I was able to find:
Thanks for your patience and advise!