on Fermi GPUs, the default global memory access pattern are caching loads (i.e. a granularity of 128-bytes). With CUDA, you may change it to non-caching loads by compiling with nvcc and “-Xptxas -dlcm=cg”.
With PGI’s OpenACC, I assume we also have caching loads be default. Right? Is there any chance to use non-caching loads with OpenACC (compiler flag, environment variable,…)?
We do have an experimental flag (-Mx,180,8) that will disable the L1 cache. You are welcome to give it a try. The caveat being that since it’s not been exposed at the user level, it is subject to change.
Thanks Mat! I will give it a try and will report my results.
Apologies for resurrecting this thread - since in the K40 we can once again use caching loads and dlcm=ca, I was wondering how I could enable this in the CUDA Fortran compiler - could you help me with that please?
We added this as the flag “-ta=tesla:noL1”.