In the Fermi compatibility guide it shows how to use the CUDA_FORCE_PTX_JIT environment variable to force JIT of the PTX code. It says that the cubin is cached by the driver and that the cache is even persistent across reboots. However, when I try this with the SDK examples it doesn’t seem to be caching at all:
[codebox][plegresl@bigbird release]$ export CUDA_FORCE_PTX_JIT=0
[plegresl@bigbird release]$ time ./simpleCUBLAS -noprompt
simpleCUBLAS test running…
PASSED
real 0m0.260s
user 0m0.170s
sys 0m0.087s
[plegresl@bigbird release]$ export CUDA_FORCE_PTX_JIT=1
[plegresl@bigbird release]$ time ./simpleCUBLAS -noprompt
simpleCUBLAS test running…
PASSED
real 1m13.848s
user 1m13.005s
sys 0m0.833s
[plegresl@bigbird release]$ time ./simpleCUBLAS -noprompt
simpleCUBLAS test running…
PASSED
real 1m13.830s
user 1m12.981s
sys 0m0.837s
[/codebox]
Is this the expected behavior? It seems like if it was working properly the third invocation would be as fast as the first.