Why texture memory is better on Fermi?

I see following lines in Fermi tuning guide:

But for my program, I still got speedup by using texture memory on GTX 480 even if I configure the L1 cache to be 48KB. I’m confused about it because I don’t use other benefits provided by texture memory and L1 cache is much large than the texture cache also. I want to know why? One possible reason is that texture cache is used exclusively but L1 cache is shared by all the memory accesses. Is that reasonable?

From what David Kirk said at VSCSE, yes all memory accesses go through L1 so you can expect to burn its entire contents after about 10 cycles. So texture accesses can still be better when you need the data to stick around for a longer time. You should look at it as an extension to the broadcast feature in previous generations. The scheduler is probably smart enough to execute the memory reads from multiple warps one after the other so when many threads access the same memory location instead of the previous ~16x broadcast speedup you might see up to ~160x speedup (just a guess, let me know what numbers you get if you try this :) ).

From what David Kirk said at VSCSE, yes all memory accesses go through L1 so you can expect to burn its entire contents after about 10 cycles. So texture accesses can still be better when you need the data to stick around for a longer time. You should look at it as an extension to the broadcast feature in previous generations. The scheduler is probably smart enough to execute the memory reads from multiple warps one after the other so when many threads access the same memory location instead of the previous ~16x broadcast speedup you might see up to ~160x speedup (just a guess, let me know what numbers you get if you try this :) ).

Fermi L1 cache line size is 128 byte. If threads inside warp access different cache lines, EVEN if they all is inside L1 cache, your got 1/32 L1 theoretical bandwidth, because each thread issues one 128 byte transaction to get 4 byte float value.

Texture cache and shared memory don’t have such problem.

Fermi L1 cache line size is 128 byte. If threads inside warp access different cache lines, EVEN if they all is inside L1 cache, your got 1/32 L1 theoretical bandwidth, because each thread issues one 128 byte transaction to get 4 byte float value.

Texture cache and shared memory don’t have such problem.

Fermi L1 cache line size is 128 byte. If threads inside warp access different cache lines, EVEN if they all is inside L1 cache, your got 1/32 L1 theoretical bandwidth, because each thread issues one 128 byte transaction to get 4 byte float value.

Texture cache and shared memory don’t have such problem.

What do you mean by texture cache and shared memory don’t have such problem?

Does that mean that texture cache don’t arrange like L1 cache that have cache lines but quite like the shared memory that 32-bit word can be accessed alone?

What do you mean by texture cache and shared memory don’t have such problem?

Does that mean that texture cache don’t arrange like L1 cache that have cache lines but quite like the shared memory that 32-bit word can be accessed alone?

What do you mean by texture cache and shared memory don’t have such problem?

Does that mean that texture cache don’t arrange like L1 cache that have cache lines but quite like the shared memory that 32-bit word can be accessed alone?

Texture cache cache line size is 32 byte, so float4 values can be accessed alone. For float penalty is 1/4 - much better, than for L1 cache.

Texture cache cache line size is 32 byte, so float4 values can be accessed alone. For float penalty is 1/4 - much better, than for L1 cache.

Texture cache cache line size is 32 byte, so float4 values can be accessed alone. For float penalty is 1/4 - much better, than for L1 cache.

That’s incorrect.

GF100 L1 cache is literally the same as shared memory. You can access 32 completely different loaded L1 cache lines at once with no serialization. You can however have bank conflicts when reading from L1, which will cause serialization if two thread reads map to the same bank.

L2 cache is very different. L2 cache is only read in serialized 128 byte cache lines, and this is probably what you’re thinking about.

That’s incorrect.

GF100 L1 cache is literally the same as shared memory. You can access 32 completely different loaded L1 cache lines at once with no serialization. You can however have bank conflicts when reading from L1, which will cause serialization if two thread reads map to the same bank.

L2 cache is very different. L2 cache is only read in serialized 128 byte cache lines, and this is probably what you’re thinking about.

That’s incorrect.

GF100 L1 cache is literally the same as shared memory. You can access 32 completely different loaded L1 cache lines at once with no serialization. You can however have bank conflicts when reading from L1, which will cause serialization if two thread reads map to the same bank.

L2 cache is very different. L2 cache is only read in serialized 128 byte cache lines, and this is probably what you’re thinking about.

Strange enough, L1 cache can’t combine different cache line requests into one request, as shared memory can.

CUDA Programming Guide:

Strange enough, L1 cache can’t combine different cache line requests into one request, as shared memory can.

CUDA Programming Guide:

Strange enough, L1 cache can’t combine different cache line requests into one request, as shared memory can.

CUDA Programming Guide:

Shared memory doesn’t have cache lines at all.

There’s a big difference between cache lines and banks. Cache lines are groups of 128 contiguous bytes aligned to 128 byte boundaries. GF100 copies values from device memory to L2 and from L2 to L1 in these 128 byte chunks.

Banks are unrelated to cache lines. GF100 shared/L1 has 32 banks, each corresponding to the low order words of the address. At each instruction tick, each bank can output its 4-byte-word to one thread who requests a word with that low order address. If two threads request different memory with the same low order word addresses, the bank can only service one of them, and another instruction tick is needed to service the next… they’re serialized. (There’s an exception for a single word broadcast of the identical address though.)

Your quote from the programming manual discusses a different topic, multi-word accesses by threads. L1 behaves just like shared memory in this case, with identical inevitable bank conflicts.

Shared memory doesn’t have cache lines at all.

There’s a big difference between cache lines and banks. Cache lines are groups of 128 contiguous bytes aligned to 128 byte boundaries. GF100 copies values from device memory to L2 and from L2 to L1 in these 128 byte chunks.

Banks are unrelated to cache lines. GF100 shared/L1 has 32 banks, each corresponding to the low order words of the address. At each instruction tick, each bank can output its 4-byte-word to one thread who requests a word with that low order address. If two threads request different memory with the same low order word addresses, the bank can only service one of them, and another instruction tick is needed to service the next… they’re serialized. (There’s an exception for a single word broadcast of the identical address though.)

Your quote from the programming manual discusses a different topic, multi-word accesses by threads. L1 behaves just like shared memory in this case, with identical inevitable bank conflicts.