Why texture/constant memory under FERMI architecture

happyjack272 · November 3, 2010, 8:57pm

as i understand it on 2.0 or greater you can split it 16kb and 48kb or vice-versa.

but the point is that you have a regular access pattern so with shared memory you can use that knowledge of the access pattern to make sure that there is never a “cache miss”.

whereas if you use cache, it’s going to be dependant on scheduler decisions which can be pretty random so it might swap out memory when you’re not done with it resulting in a big performance loss.

and that’s the general rule: caches are great for random access, but when you know the access pattern ahead of time and it’s regular, well you can use that knowledge to your advantage (e.g. reducing “misses”) by statically-scheduling memory fetches, e.g. via shared memory.

though if this is the only kernel that’ll be running on the gpu at that time, it’s arguable how random the scheduling will be. doesn’t hurt to try both ways out.

happyjack272 · November 3, 2010, 8:57pm

as i understand it on 2.0 or greater you can split it 16kb and 48kb or vice-versa.

but the point is that you have a regular access pattern so with shared memory you can use that knowledge of the access pattern to make sure that there is never a “cache miss”.

whereas if you use cache, it’s going to be dependant on scheduler decisions which can be pretty random so it might swap out memory when you’re not done with it resulting in a big performance loss.

and that’s the general rule: caches are great for random access, but when you know the access pattern ahead of time and it’s regular, well you can use that knowledge to your advantage (e.g. reducing “misses”) by statically-scheduling memory fetches, e.g. via shared memory.

though if this is the only kernel that’ll be running on the gpu at that time, it’s arguable how random the scheduling will be. doesn’t hurt to try both ways out.

happyjack272 · November 3, 2010, 9:07pm

no idea. presumably it’s SRAM so it should be near as quick as register lookup, and the spec says shared/L1 bandwidth is 16 reads/writes per warp per cycle.

i believe a warp of at least GF104+ actually handles 2 thread blocks at a time so that would explain why it’s half bandwidth (16) into an SRAM file. so all this logic seems to line up with the spec. wouldn’t know why it’s off-spec. maybe you’re counting read+write? maybe you’re short of full occupancy?

happyjack272 · November 3, 2010, 9:07pm

no idea. presumably it’s SRAM so it should be near as quick as register lookup, and the spec says shared/L1 bandwidth is 16 reads/writes per warp per cycle.

i believe a warp of at least GF104+ actually handles 2 thread blocks at a time so that would explain why it’s half bandwidth (16) into an SRAM file. so all this logic seems to line up with the spec. wouldn’t know why it’s off-spec. maybe you’re counting read+write? maybe you’re short of full occupancy?

Topic		Replies	Views
Why texture memory is better on Fermi? CUDA Programming and Performance	62	20899	January 28, 2011
Constant Arrays CUDA Programming and Performance	13	30607	November 24, 2007
Really slow constant memory Random access to constant memory CUDA Programming and Performance	13	4486	December 4, 2009
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15025	December 17, 2008
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9828	December 14, 2011
what's the benefit of using texture memory in Fermi verus using global memory CUDA Programming and Performance	12	2810	August 9, 2010
Use of Texture and Constant memory in Fermi Architecture CUDA Programming and Performance	13	15205	November 27, 2010
Constant memory usage and comparison against textures CUDA Programming and Performance	9	4074	December 24, 2008
About texture cache and spatial locality CUDA Programming and Performance	15	11280	July 24, 2009
Texture Memory / Large Data / Global Memory Advice CUDA Programming and Performance	14	10764	May 18, 2010

Why texture/constant memory under FERMI architecture

Related topics