I am not sure what you mean. What other optimizations conflict with
--use_fast_math in your use case?
In any event, I mentioned
--use_fast_math as one example of a compiler switch that is often relevant to performance. Have you checked for any defines that may differ between you online and offline builds? Are there any differences in the kernel launch parameters? Any differences in metrics collected via the CUDA profiler? Have you double checked the robustness of the performance measurement framework?
Without knowing the code, the compiler switches used, and the target GPU, I can only speculate wildly. I assume you use a controlled experiment, where all hardware and software stays exactly the same, and only the manner of compilation (online vs offline) changes.