Have you checked whether there is a clear correlation by repeating the measurements? Compile at -O2, repeat profiling three times. Compile at -O3, repeat profiling three times. Now go back to compile at -O2, repeat profiling three times.
Other than changing the optimization level of the build, have any other changes been made to the system (hardware or software)? Is it possible that there are other applications running on the system while you profile, either run by you, or by another user?
My working hypothesis here is that what you observe are either artifacts created by sub-optimal measuring methodology, or distortions of results due to outside factors. As I stated earlier, there is no causal relationship I can envision between the optimization level of the host portion of your code and the speed at which the CUDA software components do their work. You are measuring times in the microsecond range, so some fluctuation (“noise”) and random artifacts are likely, this is why I suggest the repetition of measurements.
I have never used a TX2. There may be specific caveats with respect to profiling on such a system I am not aware of, so you may consider also asking in the dedicated TX2 forum: https://devtalk.nvidia.com/default/board/188/jetson-tx2/