Fermi cache performance L1 vs L2 cache

I have rewritten a kernel that previously used texture memory to use L1 cache on Fermi cards because as far as I know L1 cache is faster than texture cache. What is strange is that using the L1 cache (configured to 48kb and compiled with -dlcm=ca) does not give a performance boost over just using L2 cache (-dlcm=cg). Quite the contrary, the second case is even slightly faster than the first case.

Any ideas why using the L1 cache for global memory accesses with lots of locality is slower than only using L2 cache? The kernel is not using any local memory that would eat up the L1 cache (according to --ptxas-options=-v).

Thanks!

Btw, what is about impact over using texture cache?

It is actually slower than the version with texture cache. I think this is because now I have to calculate the array index explicitly, which adds a few instructions in a critical code section. This was done by the texture fetch before. I thought the kernel is memory-bound and these additional instructions would not matter too much… Well I guess they matter more than the use of L1 over texture cache.

I am still curious about the L1/L2 performance though.

It is interesting. Probably you have high occupacy.