I am getting (far) lower runtimes with JIT driver compilation. Is that because the driver is a newer version than the required by the CUDA version? If the driver is a match with the CUDA version do I get the same runtimes?
generally speaking, I would expect the ptxas function in the driver that shipped with a particular toolkit to roughly match the ptxas tool that is shipped with that toolkit. So JIT vs. compile shouldn’t matter much in that scenario. I’m sure there are other factors that could be involved in your observation.
Based on historical experience, the ptxas in the online compiler and the pxtas in the offline compiler are pretty much never in perfect sync. But they should be close, and so should the generated code, as pointed out by Robert Crovella.
It is possible, but fairly unlikely, that ptxas differences between online and offline compilation lead to noticeable performance differences. It is more likely that the root cause of performance differences comes down to user configuration between the two compilation modes. Check the compiler switch settings, e.g. use of --use_fast_math.
I am not using use_fast_math. With use_fast_math I get worse time because other optimizations are forced to turn off.
Driver Version: 460.67
Reduction in RunTime: 1->0.79 (CUDA → JIT)
Ubuntu 20.04, Nvidia on Demand. Display with Intel CPU internal GPU.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
I am not sure what you mean. What other optimizations conflict with --use_fast_math in your use case?
In any event, I mentioned --use_fast_math as one example of a compiler switch that is often relevant to performance. Have you checked for any defines that may differ between you online and offline builds? Are there any differences in the kernel launch parameters? Any differences in metrics collected via the CUDA profiler? Have you double checked the robustness of the performance measurement framework?
Without knowing the code, the compiler switches used, and the target GPU, I can only speculate wildly. I assume you use a controlled experiment, where all hardware and software stays exactly the same, and only the manner of compilation (online vs offline) changes.