Why texture memory is better on Fermi?

SPWorley · September 27, 2010, 12:29am

Shared memory doesn’t have cache lines at all.

There’s a big difference between cache lines and banks. Cache lines are groups of 128 contiguous bytes aligned to 128 byte boundaries. GF100 copies values from device memory to L2 and from L2 to L1 in these 128 byte chunks.

Banks are unrelated to cache lines. GF100 shared/L1 has 32 banks, each corresponding to the low order words of the address. At each instruction tick, each bank can output its 4-byte-word to one thread who requests a word with that low order address. If two threads request different memory with the same low order word addresses, the bank can only service one of them, and another instruction tick is needed to service the next… they’re serialized. (There’s an exception for a single word broadcast of the identical address though.)

Your quote from the programming manual discusses a different topic, multi-word accesses by threads. L1 behaves just like shared memory in this case, with identical inevitable bank conflicts.

AlexanderMalishev · September 27, 2010, 4:52am

Yes, shared memory doesn’t have cache lines, but L1 has. L1 access is line based, not bank based. It’s clearly stated in both programming guide and CUDA_fermi_overview_DK.pdf(slide 9).

AlexanderMalishev · September 27, 2010, 4:52am

Yes, shared memory doesn’t have cache lines, but L1 has. L1 access is line based, not bank based. It’s clearly stated in both programming guide and CUDA_fermi_overview_DK.pdf(slide 9).

AlexanderMalishev · September 27, 2010, 4:52am

Yes, shared memory doesn’t have cache lines, but L1 has. L1 access is line based, not bank based. It’s clearly stated in both programming guide and CUDA_fermi_overview_DK.pdf(slide 9).

SPWorley · September 28, 2010, 1:02am

Alexander, you were completely correct. My understanding was wrong, and the programming guide isn’t completely clear.

I wrote a test program that accessed L1 values in various ways to actually measure throughput and resolve the question. It tests all threads accessing the same cache line, as well as all accessing different cache lines.

I did the same for shared memory.

Here’s the results. The absolute timings don’t matter, but the relative timings show clearly where the conflicts and serializations are.

“linear” means thread n accesses bank n. Scrambled means a permuted order of all 32. 16, 8, 4, 2, are bank conflicts where the same bank is used by 16, 8, 4,2 threads.

L1 data queries

linear,  Same cache line:  53.2ms

scrambled,  Same cache line:  53.2ms

bank16,  Same cache line:  53.2ms

bank8,  Same cache line:  53.2ms

bank4,  Same cache line:  53.2ms

bank2,  Same cache line:  53.2ms

bank1,  Same cache line:  53.2ms

linear Different cache line:  440.7ms

scrambled Different cache line:  440.3ms

bank16 Different cache line:  438.1ms

bank8 Different cache line:  441.1ms

bank4 Different cache line:  440.7ms

bank2 Different cache line:  440.7ms

bank1 Different cache line:  440.3ms

Shared linear,  Same cache line:  38.0ms

Shared scrambled,  Same cache line:  38.0ms

Shared bank16,  Same cache line:  38.0ms

Shared bank8,  Same cache line:  38.0ms

Shared bank4,  Same cache line:  38.0ms

Shared bank2,  Same cache line:  38.0ms

Shared bank1,  Same cache line:  38.0ms

Shared linear Different cache line:  38.0ms

Shared scrambled Different cache line:  38.0ms

Shared bank16 Different cache line:  43.0ms

Shared bank8 Different cache line:  56.4ms

Shared bank4 Different cache line:  106.2ms

Shared bank2 Different cache line:  212.2ms

Shared bank1 Different cache line:  424.1ms

So, from the L1 timings, we do see indeed that any access of data all from the same cache line is fine and constant speed.

Multiple cache lines accessed by different threads are serialized. An interesting observation is that 32 threads accessing 32 cache lines is not a 32x slowdown scaling like full serialization would imply, it only costs 8x! Perhaps the controller is able to handle 4 cache lines at once or something?

The shared memory results are as expected. If all threads access the same cache line, the 32-way broadcast feature of Fermi makes all accesses equal speed. (Compute 1.x only has 1-way broadcast.)

Otherwise, we have classic bank conflict serialization. This all makes sense since it probably is used by the caching hardware to do word reordering anyway, so why not let shared memory access it the same way? So in some sense, unlike 1.x, there is a kind of “cache line” in shared memory since the 32-way broadcast lets you ignore banking entirely if you’re only accessing a range of 128 bytes. (and it’s actually even more general than that.)

Another interesting observation. L1 reads have a lower throughput than shared memory reads… perhaps because of the need to peek at the cache lines to find them or something.

An L1 cache hit read is about 40% slower than a shared memory read.

Code attached if anyone is interested.
memspeed.cu (6.13 KB)

SPWorley · September 28, 2010, 1:02am

Alexander, you were completely correct. My understanding was wrong, and the programming guide isn’t completely clear.

I wrote a test program that accessed L1 values in various ways to actually measure throughput and resolve the question. It tests all threads accessing the same cache line, as well as all accessing different cache lines.

I did the same for shared memory.

Here’s the results. The absolute timings don’t matter, but the relative timings show clearly where the conflicts and serializations are.

“linear” means thread n accesses bank n. Scrambled means a permuted order of all 32. 16, 8, 4, 2, are bank conflicts where the same bank is used by 16, 8, 4,2 threads.

L1 data queries

linear,  Same cache line:  53.2ms

scrambled,  Same cache line:  53.2ms

bank16,  Same cache line:  53.2ms

bank8,  Same cache line:  53.2ms

bank4,  Same cache line:  53.2ms

bank2,  Same cache line:  53.2ms

bank1,  Same cache line:  53.2ms

linear Different cache line:  440.7ms

scrambled Different cache line:  440.3ms

bank16 Different cache line:  438.1ms

bank8 Different cache line:  441.1ms

bank4 Different cache line:  440.7ms

bank2 Different cache line:  440.7ms

bank1 Different cache line:  440.3ms

Shared linear,  Same cache line:  38.0ms

Shared scrambled,  Same cache line:  38.0ms

Shared bank16,  Same cache line:  38.0ms

Shared bank8,  Same cache line:  38.0ms

Shared bank4,  Same cache line:  38.0ms

Shared bank2,  Same cache line:  38.0ms

Shared bank1,  Same cache line:  38.0ms

Shared linear Different cache line:  38.0ms

Shared scrambled Different cache line:  38.0ms

Shared bank16 Different cache line:  43.0ms

Shared bank8 Different cache line:  56.4ms

Shared bank4 Different cache line:  106.2ms

Shared bank2 Different cache line:  212.2ms

Shared bank1 Different cache line:  424.1ms

So, from the L1 timings, we do see indeed that any access of data all from the same cache line is fine and constant speed.

Multiple cache lines accessed by different threads are serialized. An interesting observation is that 32 threads accessing 32 cache lines is not a 32x slowdown scaling like full serialization would imply, it only costs 8x! Perhaps the controller is able to handle 4 cache lines at once or something?

The shared memory results are as expected. If all threads access the same cache line, the 32-way broadcast feature of Fermi makes all accesses equal speed. (Compute 1.x only has 1-way broadcast.)

Otherwise, we have classic bank conflict serialization. This all makes sense since it probably is used by the caching hardware to do word reordering anyway, so why not let shared memory access it the same way? So in some sense, unlike 1.x, there is a kind of “cache line” in shared memory since the 32-way broadcast lets you ignore banking entirely if you’re only accessing a range of 128 bytes. (and it’s actually even more general than that.)

Another interesting observation. L1 reads have a lower throughput than shared memory reads… perhaps because of the need to peek at the cache lines to find them or something.

An L1 cache hit read is about 40% slower than a shared memory read.

Code attached if anyone is interested.

SPWorley · September 28, 2010, 1:02am

Alexander, you were completely correct. My understanding was wrong, and the programming guide isn’t completely clear.

I wrote a test program that accessed L1 values in various ways to actually measure throughput and resolve the question. It tests all threads accessing the same cache line, as well as all accessing different cache lines.

I did the same for shared memory.

Here’s the results. The absolute timings don’t matter, but the relative timings show clearly where the conflicts and serializations are.

“linear” means thread n accesses bank n. Scrambled means a permuted order of all 32. 16, 8, 4, 2, are bank conflicts where the same bank is used by 16, 8, 4,2 threads.

L1 data queries

linear,  Same cache line:  53.2ms

scrambled,  Same cache line:  53.2ms

bank16,  Same cache line:  53.2ms

bank8,  Same cache line:  53.2ms

bank4,  Same cache line:  53.2ms

bank2,  Same cache line:  53.2ms

bank1,  Same cache line:  53.2ms

linear Different cache line:  440.7ms

scrambled Different cache line:  440.3ms

bank16 Different cache line:  438.1ms

bank8 Different cache line:  441.1ms

bank4 Different cache line:  440.7ms

bank2 Different cache line:  440.7ms

bank1 Different cache line:  440.3ms

Shared linear,  Same cache line:  38.0ms

Shared scrambled,  Same cache line:  38.0ms

Shared bank16,  Same cache line:  38.0ms

Shared bank8,  Same cache line:  38.0ms

Shared bank4,  Same cache line:  38.0ms

Shared bank2,  Same cache line:  38.0ms

Shared bank1,  Same cache line:  38.0ms

Shared linear Different cache line:  38.0ms

Shared scrambled Different cache line:  38.0ms

Shared bank16 Different cache line:  43.0ms

Shared bank8 Different cache line:  56.4ms

Shared bank4 Different cache line:  106.2ms

Shared bank2 Different cache line:  212.2ms

Shared bank1 Different cache line:  424.1ms

So, from the L1 timings, we do see indeed that any access of data all from the same cache line is fine and constant speed.

Multiple cache lines accessed by different threads are serialized. An interesting observation is that 32 threads accessing 32 cache lines is not a 32x slowdown scaling like full serialization would imply, it only costs 8x! Perhaps the controller is able to handle 4 cache lines at once or something?

The shared memory results are as expected. If all threads access the same cache line, the 32-way broadcast feature of Fermi makes all accesses equal speed. (Compute 1.x only has 1-way broadcast.)

Otherwise, we have classic bank conflict serialization. This all makes sense since it probably is used by the caching hardware to do word reordering anyway, so why not let shared memory access it the same way? So in some sense, unlike 1.x, there is a kind of “cache line” in shared memory since the 32-way broadcast lets you ignore banking entirely if you’re only accessing a range of 128 bytes. (and it’s actually even more general than that.)

Another interesting observation. L1 reads have a lower throughput than shared memory reads… perhaps because of the need to peek at the cache lines to find them or something.

An L1 cache hit read is about 40% slower than a shared memory read.

Code attached if anyone is interested.

AlexanderMalishev · September 28, 2010, 6:14am

I run this code on the GTX460:

linear,  Same cache line:  23.7ms

scrambled,  Same cache line:  23.7ms

bank16,  Same cache line:  23.7ms

bank8,  Same cache line:  23.7ms

bank4,  Same cache line:  23.7ms

bank2,  Same cache line:  23.7ms

bank1,  Same cache line:  23.7ms

linear Different cache line:  430.4ms

scrambled Different cache line:  430.4ms

bank16 Different cache line:  431.9ms

bank8 Different cache line:  430.6ms

bank4 Different cache line:  430.4ms

bank2 Different cache line:  430.4ms

bank1 Different cache line:  430.4ms

Shared linear,  Same cache line:  24.3ms

Shared scrambled,  Same cache line:  24.3ms

Shared bank16,  Same cache line:  24.2ms

Shared bank8,  Same cache line:  24.2ms

Shared bank4,  Same cache line:  24.2ms

Shared bank2,  Same cache line:  24.2ms

Shared bank1,  Same cache line:  24.2ms

Shared linear Different cache line:  24.3ms

Shared scrambled Different cache line:  24.2ms

Shared bank16 Different cache line:  30.0ms

Shared bank8 Different cache line:  53.8ms

Shared bank4 Different cache line:  107.7ms

Shared bank2 Different cache line:  215.2ms

Shared bank1 Different cache line:  430.4ms

Sm2.1 hardware is quite different:

L1 and SM have almost equal speed
Different cache line slowdown is about 18x

AlexanderMalishev · September 28, 2010, 6:14am

I run this code on the GTX460:

linear,  Same cache line:  23.7ms

scrambled,  Same cache line:  23.7ms

bank16,  Same cache line:  23.7ms

bank8,  Same cache line:  23.7ms

bank4,  Same cache line:  23.7ms

bank2,  Same cache line:  23.7ms

bank1,  Same cache line:  23.7ms

linear Different cache line:  430.4ms

scrambled Different cache line:  430.4ms

bank16 Different cache line:  431.9ms

bank8 Different cache line:  430.6ms

bank4 Different cache line:  430.4ms

bank2 Different cache line:  430.4ms

bank1 Different cache line:  430.4ms

Shared linear,  Same cache line:  24.3ms

Shared scrambled,  Same cache line:  24.3ms

Shared bank16,  Same cache line:  24.2ms

Shared bank8,  Same cache line:  24.2ms

Shared bank4,  Same cache line:  24.2ms

Shared bank2,  Same cache line:  24.2ms

Shared bank1,  Same cache line:  24.2ms

Shared linear Different cache line:  24.3ms

Shared scrambled Different cache line:  24.2ms

Shared bank16 Different cache line:  30.0ms

Shared bank8 Different cache line:  53.8ms

Shared bank4 Different cache line:  107.7ms

Shared bank2 Different cache line:  215.2ms

Shared bank1 Different cache line:  430.4ms

Sm2.1 hardware is quite different:

L1 and SM have almost equal speed
Different cache line slowdown is about 18x

AlexanderMalishev · September 28, 2010, 6:14am

I run this code on the GTX460:

linear,  Same cache line:  23.7ms

scrambled,  Same cache line:  23.7ms

bank16,  Same cache line:  23.7ms

bank8,  Same cache line:  23.7ms

bank4,  Same cache line:  23.7ms

bank2,  Same cache line:  23.7ms

bank1,  Same cache line:  23.7ms

linear Different cache line:  430.4ms

scrambled Different cache line:  430.4ms

bank16 Different cache line:  431.9ms

bank8 Different cache line:  430.6ms

bank4 Different cache line:  430.4ms

bank2 Different cache line:  430.4ms

bank1 Different cache line:  430.4ms

Shared linear,  Same cache line:  24.3ms

Shared scrambled,  Same cache line:  24.3ms

Shared bank16,  Same cache line:  24.2ms

Shared bank8,  Same cache line:  24.2ms

Shared bank4,  Same cache line:  24.2ms

Shared bank2,  Same cache line:  24.2ms

Shared bank1,  Same cache line:  24.2ms

Shared linear Different cache line:  24.3ms

Shared scrambled Different cache line:  24.2ms

Shared bank16 Different cache line:  30.0ms

Shared bank8 Different cache line:  53.8ms

Shared bank4 Different cache line:  107.7ms

Shared bank2 Different cache line:  215.2ms

Shared bank1 Different cache line:  430.4ms

Sm2.1 hardware is quite different:

L1 and SM have almost equal speed
Different cache line slowdown is about 18x

SPWorley · September 28, 2010, 6:41am

My posted run was for the GTX480.

But that sm 2.1 output is very interesting too!

Notice also that the speeds of the L1 and shared are not only the same speed as each other, but also appear to be twice as fast as the GTX480.

This may be an artifact of the extra 2.1 SPs using ILP on the 460… but I would have expected the GTX480 to be limited by the shared accesses.

SPWorley · September 28, 2010, 6:41am

My posted run was for the GTX480.

But that sm 2.1 output is very interesting too!

Notice also that the speeds of the L1 and shared are not only the same speed as each other, but also appear to be twice as fast as the GTX480.

This may be an artifact of the extra 2.1 SPs using ILP on the 460… but I would have expected the GTX480 to be limited by the shared accesses.

SPWorley · September 28, 2010, 6:41am

My posted run was for the GTX480.

But that sm 2.1 output is very interesting too!

Notice also that the speeds of the L1 and shared are not only the same speed as each other, but also appear to be twice as fast as the GTX480.

This may be an artifact of the extra 2.1 SPs using ILP on the 460… but I would have expected the GTX480 to be limited by the shared accesses.

apaehler · September 28, 2010, 2:34pm

[codebox]linear, Same cache line: 45.0ms

scrambled, Same cache line: 45.0ms

bank16, Same cache line: 45.0ms

bank8, Same cache line: 45.0ms

bank4, Same cache line: 45.0ms

bank2, Same cache line: 45.0ms

bank1, Same cache line: 45.0ms

linear Different cache line: 492.1ms

scrambled Different cache line: 492.1ms

bank16 Different cache line: 492.1ms

bank8 Different cache line: 492.2ms

bank4 Different cache line: 492.1ms

bank2 Different cache line: 492.0ms

bank1 Different cache line: 492.0ms

Shared linear, Same cache line: 25.3ms

Shared scrambled, Same cache line: 25.3ms

Shared bank16, Same cache line: 25.3ms

Shared bank8, Same cache line: 25.3ms

Shared bank4, Same cache line: 25.3ms

Shared bank2, Same cache line: 25.3ms

Shared bank1, Same cache line: 25.3ms

Shared linear Different cache line: 25.3ms

Shared scrambled Different cache line: 25.3ms

Shared bank16 Different cache line: 31.2ms

Shared bank8 Different cache line: 57.1ms

Shared bank4 Different cache line: 114.1ms

Shared bank2 Different cache line: 228.1ms

Shared bank1 Different cache line: 456.1ms

[/codebox]

apaehler · September 28, 2010, 2:34pm

[codebox]linear, Same cache line: 45.0ms

scrambled, Same cache line: 45.0ms

bank16, Same cache line: 45.0ms

bank8, Same cache line: 45.0ms

bank4, Same cache line: 45.0ms

bank2, Same cache line: 45.0ms

bank1, Same cache line: 45.0ms

linear Different cache line: 492.1ms

scrambled Different cache line: 492.1ms

bank16 Different cache line: 492.1ms

bank8 Different cache line: 492.2ms

bank4 Different cache line: 492.1ms

bank2 Different cache line: 492.0ms

bank1 Different cache line: 492.0ms

Shared linear, Same cache line: 25.3ms

Shared scrambled, Same cache line: 25.3ms

Shared bank16, Same cache line: 25.3ms

Shared bank8, Same cache line: 25.3ms

Shared bank4, Same cache line: 25.3ms

Shared bank2, Same cache line: 25.3ms

Shared bank1, Same cache line: 25.3ms

Shared linear Different cache line: 25.3ms

Shared scrambled Different cache line: 25.3ms

Shared bank16 Different cache line: 31.2ms

Shared bank8 Different cache line: 57.1ms

Shared bank4 Different cache line: 114.1ms

Shared bank2 Different cache line: 228.1ms

Shared bank1 Different cache line: 456.1ms

[/codebox]

Wyk3d · September 28, 2010, 4:27pm

What kind of GTX 460 are you using ? (same question to AlexanderMalishev as well) There’s a big difference between the 768 MB and the 1024 MB for example. Overclocking may skew the result a bit as well.

For example I have an ASUS DirectCU GTX 460 1GB with standard clock rates (675/900/1350) and my results are much closer to what AlexanderMalishev posted:

[codebox]linear, Same cache line: 26.2ms

scrambled, Same cache line: 26.1ms

bank16, Same cache line: 26.2ms

bank8, Same cache line: 26.1ms

bank4, Same cache line: 26.2ms

bank2, Same cache line: 26.2ms

bank1, Same cache line: 26.2ms

linear Different cache line: 456.1ms

scrambled Different cache line: 456.2ms

bank16 Different cache line: 456.2ms

bank8 Different cache line: 456.1ms

bank4 Different cache line: 456.2ms

bank2 Different cache line: 456.1ms

bank1 Different cache line: 456.2ms

Shared linear, Same cache line: 26.0ms

Shared scrambled, Same cache line: 26.0ms

Shared bank16, Same cache line: 26.0ms

Shared bank8, Same cache line: 26.0ms

Shared bank4, Same cache line: 26.0ms

Shared bank2, Same cache line: 26.0ms

Shared bank1, Same cache line: 26.0ms

Shared linear Different cache line: 26.0ms

Shared scrambled Different cache line: 26.0ms

Shared bank16 Different cache line: 32.0ms

Shared bank8 Different cache line: 57.1ms

Shared bank4 Different cache line: 114.2ms

Shared bank2 Different cache line: 228.2ms

Shared bank1 Different cache line: 456.2ms[/codebox]

Oh and thanks for taking the time to make the tests SPWorley! :)

Wyk3d · September 28, 2010, 4:27pm

What kind of GTX 460 are you using ? (same question to AlexanderMalishev as well) There’s a big difference between the 768 MB and the 1024 MB for example. Overclocking may skew the result a bit as well.

For example I have an ASUS DirectCU GTX 460 1GB with standard clock rates (675/900/1350) and my results are much closer to what AlexanderMalishev posted:

[codebox]linear, Same cache line: 26.2ms

scrambled, Same cache line: 26.1ms

bank16, Same cache line: 26.2ms

bank8, Same cache line: 26.1ms

bank4, Same cache line: 26.2ms

bank2, Same cache line: 26.2ms

bank1, Same cache line: 26.2ms

linear Different cache line: 456.1ms

scrambled Different cache line: 456.2ms

bank16 Different cache line: 456.2ms

bank8 Different cache line: 456.1ms

bank4 Different cache line: 456.2ms

bank2 Different cache line: 456.1ms

bank1 Different cache line: 456.2ms

Shared linear, Same cache line: 26.0ms

Shared scrambled, Same cache line: 26.0ms

Shared bank16, Same cache line: 26.0ms

Shared bank8, Same cache line: 26.0ms

Shared bank4, Same cache line: 26.0ms

Shared bank2, Same cache line: 26.0ms

Shared bank1, Same cache line: 26.0ms

Shared linear Different cache line: 26.0ms

Shared scrambled Different cache line: 26.0ms

Shared bank16 Different cache line: 32.0ms

Shared bank8 Different cache line: 57.1ms

Shared bank4 Different cache line: 114.2ms

Shared bank2 Different cache line: 228.2ms

Shared bank1 Different cache line: 456.2ms[/codebox]

Oh and thanks for taking the time to make the tests SPWorley! :)

AlexanderMalishev · September 28, 2010, 4:47pm

Gigabyte GV-N460OC-1GI 1Gb.

AlexanderMalishev · September 28, 2010, 4:47pm

Gigabyte GV-N460OC-1GI 1Gb.

breezee · September 29, 2010, 9:10am

Thank for all the discussions about my problems.

I think it’s clear now that threads of the same warp in my program are accessing different cache lines of L1 cache that make it worse than texture cache. L1 cache is accessed in the unit of 128-bit cache lines. And what about texture cache? How was it accessed?