1d texture cache

DarkAr · October 7, 2008, 7:45pm

Hi

what is a layout of texture cache in case of tex1Dfetch ? what is a size of that cache
(for 280GTX)
I have kernel that is mainly memory bound.
Each threads block reads bunch of variables from global memory, does some calculations and out one variable.
Currently input data is organized as 1d array, and each thread reads 256 float variables at locally random places inside this array
(locally means here more or less: blockIdx +/- rand()%4096)

Texture cache is per thread block resource (like shared memory) or per multiprocessor ?
how it works, when i read table at index i, and then at i+10, the cache will always hit ?

thanks for any information in this topic.
I’ll try to organize data in my array in more cache friendly fassion.

E.D_Riedijk · October 8, 2008, 5:29am

Texture cache is per Texture Processing Unit. For G200 there are 3 Multiprocessors per TPC. For older architectures there are 2 Multiprocessors per TPC. As far as I remember, texture cache was said to be 8 kB per Multiproc., so older arch had 16 kB of cache in a TPC, I guess that has changed to 24 kB per TPC for G200, but have nowhere read any numbers.
As to the question how much offset you can have and still hit the cache, I think it is wise to make a small benchmarking kernel where you change the offsets and watch the resulting throughput. just try to make sure that the difference between blocks in your benchmark looks a little like the difference in your real kernel. As you can see 3 MP’s share the same cache, so if you have more than 1 block per MP, you have like 6 or 9 blocks sharing the same cache, so that can affect things too.

MisterAnderson42 · October 8, 2008, 12:23pm

As E.D. Riedijk says, benchmarking is the only way to know for certain.

But if you’d rather have an educated guess: I’d say that a random access width 4096 is small enough to get some small benefit from the texture cache, but certainly nowhere near the full device mem bandwidth. You need very tight 1D locality for all accesses made within a warp to get that.

Reimar · October 8, 2008, 12:43pm

I miss some information how many threads you use per block.

E.g. if you were using 512 threads per block, you would be reading 512*256 = 131072 or on average each element 16 times per block.

In that situation youo probably can win a lot by just copying a (fixed) part of the data into shared memory (coalesced, no textures!) and code a switch that takes the data from that shared memory block if it is there and from global memory otherwise (using tex1Dfetch if you want).

Since you do not tell us the details, you will have to decide yourself which method is likely to work best.

DarkAr · October 8, 2008, 2:31pm

the details are:

512 threads
each thread reads 256 values from array
something like this:

void Kernel(float *array)
{
float GatherRandom = 0;
int offset = 4096 + (blockIdx.x * 8196);

for (int i = 0; i < 256; i++)
GatherRandom += array[((somerandomfunction()%8192)-4096)+offset];

// do something with GatherRandom
}

i’v coded version that copies portions of ‘array’ to shared memory in coalesced manner
and then does lockup from all threads, but this gets very tricky since ‘somerandomfunction’ is not trivial and i need to remember it’s states, then restore, then go again External Image
but it is faster than texture cache now.

dear nvidia shared memory really should grow to at least 64kb in near future :D :D :D

Topic		Replies	Views
Texture memory performance CUDA Programming and Performance	4	4976	June 1, 2009
Texture cache characteristics 2D cache size CUDA Programming and Performance	5	6105	May 8, 2007
Texture cache architecture Line size of texture cache CUDA Programming and Performance	3	2932	August 27, 2008
CUDA texture cache CUDA Programming and Performance	7	12234	June 7, 2008
Understanding GPU caches can't get my head around it CUDA Programming and Performance	12	4848	March 14, 2009
CUDA texture memory performance CUDA Programming and Performance	4	33583	January 13, 2009
Confusion on using texture? CUDA Programming and Performance	14	4930	September 4, 2009
texture memory cache size CUDA Programming and Performance	3	3441	November 3, 2009
Texture access... Fetch size, Cache size & performance CUDA Programming and Performance	3	2923	December 29, 2009
simple memory questions CUDA Programming and Performance	3	1009	August 19, 2009

1d texture cache

Related topics