I have rewritten a kernel that previously used texture memory to use L1 cache on Fermi cards because as far as I know L1 cache is faster than texture cache. What is strange is that using the L1 cache (configured to 48kb and compiled with -dlcm=ca) does not give a performance boost over just using L2 cache (-dlcm=cg). Quite the contrary, the second case is even slightly faster than the first case.
Any ideas why using the L1 cache for global memory accesses with lots of locality is slower than only using L2 cache? The kernel is not using any local memory that would eat up the L1 cache (according to --ptxas-options=-v).