PTX JIT caching

uh, CUDA_FORCE_PTX_JIT is still set to 1 in the third invocation unless I am crazy.

(alternately: FORCE_PTX_JIT actually forces a compile, it does not use the JIT cache.)