Why texture/constant memory under FERMI architecture

as i understand it on 2.0 or greater you can split it 16kb and 48kb or vice-versa.

but the point is that you have a regular access pattern so with shared memory you can use that knowledge of the access pattern to make sure that there is never a “cache miss”.

whereas if you use cache, it’s going to be dependant on scheduler decisions which can be pretty random so it might swap out memory when you’re not done with it resulting in a big performance loss.

and that’s the general rule: caches are great for random access, but when you know the access pattern ahead of time and it’s regular, well you can use that knowledge to your advantage (e.g. reducing “misses”) by statically-scheduling memory fetches, e.g. via shared memory.

though if this is the only kernel that’ll be running on the gpu at that time, it’s arguable how random the scheduling will be. doesn’t hurt to try both ways out.

as i understand it on 2.0 or greater you can split it 16kb and 48kb or vice-versa.

but the point is that you have a regular access pattern so with shared memory you can use that knowledge of the access pattern to make sure that there is never a “cache miss”.

whereas if you use cache, it’s going to be dependant on scheduler decisions which can be pretty random so it might swap out memory when you’re not done with it resulting in a big performance loss.

and that’s the general rule: caches are great for random access, but when you know the access pattern ahead of time and it’s regular, well you can use that knowledge to your advantage (e.g. reducing “misses”) by statically-scheduling memory fetches, e.g. via shared memory.

though if this is the only kernel that’ll be running on the gpu at that time, it’s arguable how random the scheduling will be. doesn’t hurt to try both ways out.

no idea. presumably it’s SRAM so it should be near as quick as register lookup, and the spec says shared/L1 bandwidth is 16 reads/writes per warp per cycle.

i believe a warp of at least GF104+ actually handles 2 thread blocks at a time so that would explain why it’s half bandwidth (16) into an SRAM file. so all this logic seems to line up with the spec. wouldn’t know why it’s off-spec. maybe you’re counting read+write? maybe you’re short of full occupancy?

no idea. presumably it’s SRAM so it should be near as quick as register lookup, and the spec says shared/L1 bandwidth is 16 reads/writes per warp per cycle.

i believe a warp of at least GF104+ actually handles 2 thread blocks at a time so that would explain why it’s half bandwidth (16) into an SRAM file. so all this logic seems to line up with the spec. wouldn’t know why it’s off-spec. maybe you’re counting read+write? maybe you’re short of full occupancy?