Calling JIT code from a kernel Is it possible to call Just-in-time code from a already loaded kernel

Hello, I have a large kernel that repeatedly calls another small kernel. I would like to tweak the small kernel with each run but do not want to have to recompile everything each time. I would like to recompile(JIT) just a small piece of code. Is it possible to do just-in-time compiling on just a small piece of code without recompiling all the device code?

I thought i could maybe do this by loading several kernels into one module and then just replace the one kernel but this did not work. When I tried loading multiple kernels into one Module with cuModuleLoadDataEx(…) it would just overwrite anything that was there.

10,000 foot view

LOOP
…–CPU: Compile PTX for Kernal 2
…LOOP
…Kernel 1: 200 lines of non-changing code –
…LOOP
…Kernel 2: 30 lines (JIT code is here)
…END LOOP
…Kernel: 200 lines of non-changing code –
…END LOOP
END LOOP

Ideas that have not worked:
-It looks like it can be done exiting in the middle of the kernel then calling the JIT kernel and then running the last part (over and over). But this would be slow because i would need to do the work in several 100 kernel calls instead of just one.
-I could just skip JIT, For the 8 lines of code I could just use an 800 way switch but the ptx code has to do many jumps and this would also be slow.

Thank you for any ideas.

In case anyone is wondering, I just verified my assumption from earlier…

Calling a test kernel 100 times took .59 ms …

...

cudaEventRecord(start, 0);

increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value);

cudaEventRecord(stop, 0);

...

__global__ void increment_kernel(int *g_data, int inc_value)

{ 

   for(int i =0;i<100;i++)

   {

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	g_data[idx] = g_data[idx] + (inc_value * 2) + (999 / idx); //some extra work added

   }

}

Calling a kernel once and looping 100 times in one kernel(the same amount of work) only took .04 ms…

...

cudaEventRecord(start, 0);

for(int i =0;i<100;i++)

	increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value);

cudaEventRecord(stop, 0);

...

__global__ void increment_kernel(int *g_data, int inc_value)

{ 

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	g_data[idx] = g_data[idx] + (inc_value * 2) + (999 / idx); //some extra work added

}

It looks like jumping out of one kernel just to execute some JIT code would be slow. (If I did my test correct.)