Use of Texture and Constant memory in Fermi Architecture

Sai · October 30, 2010, 6:27am

With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?
As the loads and stores to global memory are cached, do we get any performance gain using constant and texture memory?
Please correct my understanding if wrong…

Thanks,
Sai

Sai · October 30, 2010, 6:27am

With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?
As the loads and stores to global memory are cached, do we get any performance gain using constant and texture memory?
Please correct my understanding if wrong…

Thanks,
Sai

AlexanderMalishev · October 31, 2010, 10:38pm

Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.

immediate constant

set $p0 ne u32 $r4 -0x1

constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory

ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

AlexanderMalishev · October 31, 2010, 10:38pm

Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.

immediate constant

set $p0 ne u32 $r4 -0x1

constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory

ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

Sai · November 1, 2010, 1:35am

Hi Alexander,

Thanks for reply…

Your first reply says “Constant memory access with known offest is essentially free”. You mean to say that the constant arguments in instructions are stored in constant memory.

Could you please elaborate on that point?

Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.
immediate constant

set $p0 ne u32 $r4 -0x1

constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory

ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]
Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

Sai · November 1, 2010, 1:35am

Hi Alexander,

Thanks for reply…

Your first reply says “Constant memory access with known offest is essentially free”. You mean to say that the constant arguments in instructions are stored in constant memory.

Could you please elaborate on that point?

Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.
immediate constant

set $p0 ne u32 $r4 -0x1

constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory

ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]
Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

AlexanderMalishev · November 1, 2010, 6:05am

There are two cases:

Constant argument in instruction is the part of the instruction. So it stored in the instruction cache. Example: mul.f32 r0 r1 3.1415
Constant argument could be loaded from constant memory. Example: mul.f32 r0 r1 c0[10]

I mean what you don’t need to use separate instruction to load data from constant memory. Almost every instruction could load data from constant memory.

For example, to add value from constant memory you need just one instruction:

add.f32 r0 r0 c2[0x35]

To add value from global memory you need two instructions:

ld.u32 r1 g[0x35]

add r0 r0 r1

AlexanderMalishev · November 1, 2010, 6:05am

There are two cases:

Constant argument in instruction is the part of the instruction. So it stored in the instruction cache. Example: mul.f32 r0 r1 3.1415
Constant argument could be loaded from constant memory. Example: mul.f32 r0 r1 c0[10]

I mean what you don’t need to use separate instruction to load data from constant memory. Almost every instruction could load data from constant memory.

For example, to add value from constant memory you need just one instruction:

add.f32 r0 r0 c2[0x35]

To add value from global memory you need two instructions:

ld.u32 r1 g[0x35]

add r0 r0 r1

Sai · November 3, 2010, 3:34am

got it thank you

Sai · November 3, 2010, 3:34am

got it thank you

minhtuan · November 22, 2010, 3:12pm

Hi Malishev, do you known load granularity for constant cache memory?

Thanks,

Tuan

minhtuan · November 22, 2010, 3:12pm

Hi Malishev, do you known load granularity for constant cache memory?

Thanks,

Tuan

AlexanderMalishev · November 27, 2010, 7:27am

I don’t know.

Simple benchmark could reveal a lots of detail (see for example “Demystifying GPU Microarchitecture through Microbenchmarking” paper). But currently I don’t have enough time to study it.

AlexanderMalishev · November 27, 2010, 7:27am

I don’t know.

Simple benchmark could reveal a lots of detail (see for example “Demystifying GPU Microarchitecture through Microbenchmarking” paper). But currently I don’t have enough time to study it.

Topic		Replies	Views
Why texture/constant memory under FERMI architecture CUDA Programming and Performance	23	4175	November 3, 2010
Texture memory vs. constant memory access latency CUDA Programming and Performance	3	12535	June 14, 2011
what's the benefit of using texture memory in Fermi verus using global memory CUDA Programming and Performance	12	2849	August 9, 2010
Speed of Constant memory over Textures CUDA Programming and Performance	2	6994	December 24, 2009
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4560	December 4, 2009
__constant__ on Fermi being read through global mem CUDA Programming and Performance	4	2697	March 21, 2011
Constants vs Texture Memory CUDA Programming and Performance	4	7447	February 21, 2007
Constant Arrays CUDA Programming and Performance	13	30686	November 24, 2007
How to choose the good memory CUDA Programming and Performance	2	4272	December 7, 2007
the worse performance using texture memory any ideas? CUDA Programming and Performance	4	1450	July 5, 2011

Use of Texture and Constant memory in Fermi Architecture

Related topics