Why texture memory is better on Fermi?

breezee · September 25, 2010, 6:46am

I see following lines in Fermi tuning guide:

But for my program, I still got speedup by using texture memory on GTX 480 even if I configure the L1 cache to be 48KB. I’m confused about it because I don’t use other benefits provided by texture memory and L1 cache is much large than the texture cache also. I want to know why? One possible reason is that texture cache is used exclusively but L1 cache is shared by all the memory accesses. Is that reasonable?

Wyk3d · September 25, 2010, 8:52am

From what David Kirk said at VSCSE, yes all memory accesses go through L1 so you can expect to burn its entire contents after about 10 cycles. So texture accesses can still be better when you need the data to stick around for a longer time. You should look at it as an extension to the broadcast feature in previous generations. The scheduler is probably smart enough to execute the memory reads from multiple warps one after the other so when many threads access the same memory location instead of the previous ~16x broadcast speedup you might see up to ~160x speedup (just a guess, let me know what numbers you get if you try this :) ).

Wyk3d · September 25, 2010, 8:52am

From what David Kirk said at VSCSE, yes all memory accesses go through L1 so you can expect to burn its entire contents after about 10 cycles. So texture accesses can still be better when you need the data to stick around for a longer time. You should look at it as an extension to the broadcast feature in previous generations. The scheduler is probably smart enough to execute the memory reads from multiple warps one after the other so when many threads access the same memory location instead of the previous ~16x broadcast speedup you might see up to ~160x speedup (just a guess, let me know what numbers you get if you try this :) ).

AlexanderMalishev · September 26, 2010, 8:50am

Fermi L1 cache line size is 128 byte. If threads inside warp access different cache lines, EVEN if they all is inside L1 cache, your got 1/32 L1 theoretical bandwidth, because each thread issues one 128 byte transaction to get 4 byte float value.

Texture cache and shared memory don’t have such problem.

AlexanderMalishev · September 26, 2010, 8:50am

Fermi L1 cache line size is 128 byte. If threads inside warp access different cache lines, EVEN if they all is inside L1 cache, your got 1/32 L1 theoretical bandwidth, because each thread issues one 128 byte transaction to get 4 byte float value.

Texture cache and shared memory don’t have such problem.

AlexanderMalishev · September 26, 2010, 8:50am

Fermi L1 cache line size is 128 byte. If threads inside warp access different cache lines, EVEN if they all is inside L1 cache, your got 1/32 L1 theoretical bandwidth, because each thread issues one 128 byte transaction to get 4 byte float value.

Texture cache and shared memory don’t have such problem.

breezee · September 26, 2010, 8:55am

What do you mean by texture cache and shared memory don’t have such problem?

Does that mean that texture cache don’t arrange like L1 cache that have cache lines but quite like the shared memory that 32-bit word can be accessed alone?

breezee · September 26, 2010, 8:55am

What do you mean by texture cache and shared memory don’t have such problem?

Does that mean that texture cache don’t arrange like L1 cache that have cache lines but quite like the shared memory that 32-bit word can be accessed alone?

breezee · September 26, 2010, 8:55am

What do you mean by texture cache and shared memory don’t have such problem?

Does that mean that texture cache don’t arrange like L1 cache that have cache lines but quite like the shared memory that 32-bit word can be accessed alone?

AlexanderMalishev · September 26, 2010, 9:11am

Texture cache cache line size is 32 byte, so float4 values can be accessed alone. For float penalty is 1/4 - much better, than for L1 cache.

AlexanderMalishev · September 26, 2010, 9:11am

Texture cache cache line size is 32 byte, so float4 values can be accessed alone. For float penalty is 1/4 - much better, than for L1 cache.

AlexanderMalishev · September 26, 2010, 9:11am

Texture cache cache line size is 32 byte, so float4 values can be accessed alone. For float penalty is 1/4 - much better, than for L1 cache.

SPWorley · September 26, 2010, 4:35pm

That’s incorrect.

GF100 L1 cache is literally the same as shared memory. You can access 32 completely different loaded L1 cache lines at once with no serialization. You can however have bank conflicts when reading from L1, which will cause serialization if two thread reads map to the same bank.

L2 cache is very different. L2 cache is only read in serialized 128 byte cache lines, and this is probably what you’re thinking about.

SPWorley · September 26, 2010, 4:35pm

That’s incorrect.

GF100 L1 cache is literally the same as shared memory. You can access 32 completely different loaded L1 cache lines at once with no serialization. You can however have bank conflicts when reading from L1, which will cause serialization if two thread reads map to the same bank.

L2 cache is very different. L2 cache is only read in serialized 128 byte cache lines, and this is probably what you’re thinking about.

SPWorley · September 26, 2010, 4:35pm

That’s incorrect.

GF100 L1 cache is literally the same as shared memory. You can access 32 completely different loaded L1 cache lines at once with no serialization. You can however have bank conflicts when reading from L1, which will cause serialization if two thread reads map to the same bank.

L2 cache is very different. L2 cache is only read in serialized 128 byte cache lines, and this is probably what you’re thinking about.

AlexanderMalishev · September 26, 2010, 6:04pm

Strange enough, L1 cache can’t combine different cache line requests into one request, as shared memory can.

CUDA Programming Guide:

AlexanderMalishev · September 26, 2010, 6:04pm

Strange enough, L1 cache can’t combine different cache line requests into one request, as shared memory can.

CUDA Programming Guide:

AlexanderMalishev · September 26, 2010, 6:04pm

Strange enough, L1 cache can’t combine different cache line requests into one request, as shared memory can.

CUDA Programming Guide:

SPWorley · September 27, 2010, 12:29am

Shared memory doesn’t have cache lines at all.

There’s a big difference between cache lines and banks. Cache lines are groups of 128 contiguous bytes aligned to 128 byte boundaries. GF100 copies values from device memory to L2 and from L2 to L1 in these 128 byte chunks.

Banks are unrelated to cache lines. GF100 shared/L1 has 32 banks, each corresponding to the low order words of the address. At each instruction tick, each bank can output its 4-byte-word to one thread who requests a word with that low order address. If two threads request different memory with the same low order word addresses, the bank can only service one of them, and another instruction tick is needed to service the next… they’re serialized. (There’s an exception for a single word broadcast of the identical address though.)

Your quote from the programming manual discusses a different topic, multi-word accesses by threads. L1 behaves just like shared memory in this case, with identical inevitable bank conflicts.

SPWorley · September 27, 2010, 12:29am

Shared memory doesn’t have cache lines at all.

There’s a big difference between cache lines and banks. Cache lines are groups of 128 contiguous bytes aligned to 128 byte boundaries. GF100 copies values from device memory to L2 and from L2 to L1 in these 128 byte chunks.

Banks are unrelated to cache lines. GF100 shared/L1 has 32 banks, each corresponding to the low order words of the address. At each instruction tick, each bank can output its 4-byte-word to one thread who requests a word with that low order address. If two threads request different memory with the same low order word addresses, the bank can only service one of them, and another instruction tick is needed to service the next… they’re serialized. (There’s an exception for a single word broadcast of the identical address though.)

Your quote from the programming manual discusses a different topic, multi-word accesses by threads. L1 behaves just like shared memory in this case, with identical inevitable bank conflicts.

Topic		Replies	Views
Why texture/constant memory under FERMI architecture CUDA Programming and Performance	23	4175	November 3, 2010
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9875	December 14, 2011
About texture cache and spatial locality CUDA Programming and Performance	15	11355	July 24, 2009
what's the benefit of using texture memory in Fermi verus using global memory CUDA Programming and Performance	12	2849	August 9, 2010
Fermi question CUDA Programming and Performance	30	5766	May 26, 2010
Doubts related to CUDA CUDA Programming and Performance	17	11874	November 18, 2010
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1788	June 11, 2008
GTX 470 performance gains too low ? (texture operations) CUDA Programming and Performance	16	11073	April 22, 2010
Using texture cache or L1 and L2 chache CUDA Programming and Performance	7	1269	November 25, 2010
Texture Reads What is the source of performance increase? CUDA Programming and Performance	3	5968	March 9, 2011

Why texture memory is better on Fermi?

Related topics