CUDA Expression Templates and Just in Time Compilation (JIT)

Hi everybody, I have some questions about Just-In-Time (JIT) Compilation.

I have implemented a library based on Expression Templates according to the paper

J.M. Cohen, “Processing Device Arrays with C++ Metaprogramming”, GPU Computing Gems - Jade Edition

It seems to work fairly good. If I compare the computing time of the matrix elementwise operation

D_D=A_D*B_D-sin(C_D)+3.;

with that of a purposely developed CUDA kernel, I have the following results:

matrix size: 1024x1024 2048x2048 4096*4096
time [ms] hand-written kerne: 2.05 8.16 57.4
time [ms] LIBRARY: 2.07 8.17 57.4

The library seems to need approximately the same computing time of the hand-written kernel. I’m also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to http://stackoverflow.com/questions/15856122/expression-templates-improving-performance-in-evaluating-expressions. My first question is

1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library?Would JIT introduce any further burdening due to runtime compilation?

It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:

2. Would JIT help in hiding the implementation to a third-party user? If yes, how?

The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:

3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?

Thank you very much for any help.

I try to better explain my problem.

From the Cuda kernel just-in-time (jit) compilation possible? post, see

http://stackoverflow.com/questions/13567123/cuda-kernel-just-in-time-jit-compilation-possible

it reads that:

cuda code can be compiled to an intermediate format ptx code, which will then be jit-compiled to the actual device architecture machine code at runtime.

A doubt I have is whether the above can be applied to an Expression Templates library. I know that, due to instantiation problems, a CUDA/C++ template code cannot be compiled to a PTX. But perhaps if I instantiate all the possible combinations of Type/Operators for Unary and Binary Expressions, at least a part of the implementation can be compiled (and then masked to third-party users) to PTX which can be in turn JIT compiled to the architecture at hand.