For a complete test case, I would like to see an entire code (all host code and device code needed to build a complete application, without me having to add anything or change anything - copy, paste, compile, run) along with the platform (OS, GPU) you are running on, as well as the complete compile command.
If any of that is missing, I’m less likely to spend any time on it. For example, I wouldn’t want to waste time trying to analyze code, only to discover that OP is compiling a debug instead of release project, and trying to do performance analysis on debug code.
At first glance, your codes look different to me because in one case you are loading the a and b quantities exactly once (the fast case) and storing c exactly once, and in the other case you are (in source code, at least) loading the a and b quantities multiple times and potentially storing c multiple times. To state equivalence between the two presumes things about the compiler that I’m not sure are always true.
I also hate to try and analyze artificial code, like the loops of 1000. I don’t know what guesses the compiler will be doing under the hood. To analyze performance best, my suggestion would be to just work on large data sets, rather than artificially increasing the work by 1000. The compiler might discover things about your loop of 1000 where all the data is per-thread local data, that it cannot/does not discover for the case where some of the data is global data.