Hi everybody, I have some questions about Just-In-Time (JIT) Compilation.
I have implemented a library based on Expression Templates according to the paper
J.M. Cohen, “Processing Device Arrays with C++ Metaprogramming”, GPU Computing Gems - Jade Edition
It seems to work fairly good. If I compare the computing time of the matrix elementwise operation
with that of a purposely developed CUDA kernel, I have the following results:
matrix size: 1024x1024 2048x2048 4096*4096
time [ms] hand-written kerne: 2.05 8.16 57.4
time [ms] LIBRARY: 2.07 8.17 57.4
The library seems to need approximately the same computing time of the hand-written kernel. I’m also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to http://stackoverflow.com/questions/15856122/expression-templates-improving-performance-in-evaluating-expressions. My first question is
1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library?Would JIT introduce any further burdening due to runtime compilation?
It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:
2. Would JIT help in hiding the implementation to a third-party user? If yes, how?
The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:
3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?
Thank you very much for any help.