I have some pretty big CUDA app here (thousands of lines).
After months of development, I have noticed that the performance of the app went down (up to 30% slower).
nvcc output shows that there are registry spills (there is a function with 2000 bytes load spill, many functions with up to 200 bytes load/store spills).
Is there a way, a paper, or a tool that can tell me which lines of my code are causing that ?
Let say that even I have read some of GPGPU & CUDA books, I still have absolutely no idea how to deal with this problem. I need to understand it better, to be able to find the issue and to know to fix it.
Lets say that I have tried almost everything I found on the internet - maxregcount, load_bounds, inline and noinline, volatile, restrict, bitfields instead of bools/ints, C-style flags, all of that in all configurations possible.
When the app was smaller that seemed to help, but not anymore.
Yet, these methods are pure magic, one have to add use them with every configuration possible, compare results and see what works best ?
Any help will be greatly appreciated.