Use of Texture and Constant memory in Fermi Architecture

With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?
As the loads and stores to global memory are cached, do we get any performance gain using constant and texture memory?
Please correct my understanding if wrong…

Thanks,
Sai

With the advent of Fermi architecture that has L2 and L1 cache, does the significance of Texture and Constant memory still hold?
As the loads and stores to global memory are cached, do we get any performance gain using constant and texture memory?
Please correct my understanding if wrong…

Thanks,
Sai

Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.

immediate constant

set $p0 ne u32 $r4 -0x1

constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory

ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

Constant memory access with known offset is essentially free. Fermi is a RISC processor, instruction arguments come from registers, immediate constant or constant buffer. Otherwise separate load instruction is needed.

immediate constant

set $p0 ne u32 $r4 -0x1

constant buffer

add b32 $r12 shl $r13 0x2 c2[0xc8]

global memory

ld b32 $r4 ca g[$r12(null)+0]

constant buffer with unknown offset

ld b32 $r17 c2[$r17(null)+0x20]

Texture load could be faster than normal global memory. Global memory load granularity is 128 bytes. If you load 4 bytes, hardware fetches 124 neighbors too. Texture load granularity is 128 bytes at L1 level, but only 32 bytes at L2 and memory level.

Hi Alexander,

Thanks for reply…

Your first reply says “Constant memory access with known offest is essentially free”. You mean to say that the constant arguments in instructions are stored in constant memory.

Could you please elaborate on that point?

Hi Alexander,

Thanks for reply…

Your first reply says “Constant memory access with known offest is essentially free”. You mean to say that the constant arguments in instructions are stored in constant memory.

Could you please elaborate on that point?

There are two cases:

  1. Constant argument in instruction is the part of the instruction. So it stored in the instruction cache. Example: mul.f32 r0 r1 3.1415

  2. Constant argument could be loaded from constant memory. Example: mul.f32 r0 r1 c0[10]

I mean what you don’t need to use separate instruction to load data from constant memory. Almost every instruction could load data from constant memory.

For example, to add value from constant memory you need just one instruction:

add.f32 r0 r0 c2[0x35]

To add value from global memory you need two instructions:

ld.u32 r1 g[0x35]

add r0 r0 r1

There are two cases:

  1. Constant argument in instruction is the part of the instruction. So it stored in the instruction cache. Example: mul.f32 r0 r1 3.1415

  2. Constant argument could be loaded from constant memory. Example: mul.f32 r0 r1 c0[10]

I mean what you don’t need to use separate instruction to load data from constant memory. Almost every instruction could load data from constant memory.

For example, to add value from constant memory you need just one instruction:

add.f32 r0 r0 c2[0x35]

To add value from global memory you need two instructions:

ld.u32 r1 g[0x35]

add r0 r0 r1

got it thank you

got it thank you

Hi Malishev, do you known load granularity for constant cache memory?

Thanks,

Tuan

Hi Malishev, do you known load granularity for constant cache memory?

Thanks,

Tuan

I don’t know.

Simple benchmark could reveal a lots of detail (see for example “Demystifying GPU Microarchitecture through Microbenchmarking” paper). But currently I don’t have enough time to study it.

I don’t know.

Simple benchmark could reveal a lots of detail (see for example “Demystifying GPU Microarchitecture through Microbenchmarking” paper). But currently I don’t have enough time to study it.