Cache behavior when loading global data to shared memory in Fermi

If I load data in global memory to shared memory, do these loads also go through L1? (Configuration makes sure that global loads do not bypass L1).

There is no “load from global to shared” operation. It’s broken down into two steps… load from global to register, store from register to shared. So your question is really about “does a load from global memory get cached in L1?” The answer is yes by default. But you can change this behavior with several methods. If you use the inline PTX call “ld.global.cg”, the read will be cached in L2 but not L1. “ld.global.cs” will not cache the read in L1 or L2. Alternatively, you can use a simple NVCC flag to make either of these types of reads the default by using “-Xptxas -dlcm=cg” or “-Xptxas -dlcm=cs”, though of course that then applies to ALL reads in your program.