Texture Reads What is the source of performance increase?

I am trying to understand the source of performance increase when using textures on Fermi.
I am speculating here.
I would appreciate if someone confirmed or denied my suspicions.

According to the documentation there should be no direct performance increase.
However, also according to the documentation, texture reads bypass L1.

So, correct me if I am wrong.
If I am loading to shared memory without using textures,
I read to a register (through L1) and drop in shared memory.
Which makes no sense at all, because all I am accomplishing is polluting L1.

I can disable caching in L1 through a compiler flag, but that will also disable L1 cashing for local variables, which I want cashed in L1.
So, in other words, I want cashing for local variables in L1 (so should not disable L1 cashing),
but I don’t want L1 cashing for my “actual data”, so I should declare it as texture (if it is read only).

Did I get it right?

Most often the reason is that non texture reads have a cacheline size of 128 byte opposed to 32 byte for texture reads.
So in case you access random 32byte (float4) or smaller structures in the memory, you only need to laod 1/4th of the elements in Texture cache as you would need to read in L2/L1 cache. While the peak bandwidth of L2 cache is higher than that of the texture cache, for random access textures are still way better.

Ceearem

P.S. google is your friend: “cuda fermi texture L2” gives the following two posts in these forums:

Okay, nothing random about my access.

Always fetching in chunks of 128 bytes.

What about loading shared memory?

The transfer is device_memory → registers → shared_memory, right?

So with L1 on, the transfer is device_memory → L2 → L1 → registers → shared_memory, right?

So, if the access is “read only”, it only makes sense to use textures for the data, right?