Thanks for getting back to me Mat!
I was able to get the faster runtimes in my MRE with the -Minline flag, which is nice. However, I have not been able to get the subroutine call in my main code base to inline, even when I pass in the -Minline flag, turn the optimization up to -04, and simplify the logic of the subroutine. The strange part is that -Minfo=inline does not give me any reason as to why it is not inlining, as it does for some other subroutines in my code base. I haven’t been able to replicate this behavior with my MRE yet, but do you have any ideas as to what may be causing this?
Regarding register usage: this does seem to be an issue with the subroutine in our main code base, at least based on some of the output from nsight compute. The report suggests that in the lead up to the subroutine we use significantly more registers, and the average number of threads we run per warp goes down to 1. Is it fair to assume the decrease in threads is due to register pressure, or could there be other causes of this?
