Bypassing cache in Fermi

Couple of complier intrinsic would solve this problem at the CUDA C level:

template T __load(T *address , LOAD_OPTIONS options);
template void __store(T *address , T value, STORE_OPTIONS options);