what is a layout of texture cache in case of tex1Dfetch ? what is a size of that cache
I have kernel that is mainly memory bound.
Each threads block reads bunch of variables from global memory, does some calculations and out one variable.
Currently input data is organized as 1d array, and each thread reads 256 float variables at locally random places inside this array
(locally means here more or less: blockIdx +/- rand()%4096)
Texture cache is per thread block resource (like shared memory) or per multiprocessor ?
how it works, when i read table at index i, and then at i+10, the cache will always hit ?
thanks for any information in this topic.
I’ll try to organize data in my array in more cache friendly fassion.
Texture cache is per Texture Processing Unit. For G200 there are 3 Multiprocessors per TPC. For older architectures there are 2 Multiprocessors per TPC. As far as I remember, texture cache was said to be 8 kB per Multiproc., so older arch had 16 kB of cache in a TPC, I guess that has changed to 24 kB per TPC for G200, but have nowhere read any numbers.
As to the question how much offset you can have and still hit the cache, I think it is wise to make a small benchmarking kernel where you change the offsets and watch the resulting throughput. just try to make sure that the difference between blocks in your benchmark looks a little like the difference in your real kernel. As you can see 3 MP’s share the same cache, so if you have more than 1 block per MP, you have like 6 or 9 blocks sharing the same cache, so that can affect things too.
As E.D. Riedijk says, benchmarking is the only way to know for certain.
But if you’d rather have an educated guess: I’d say that a random access width 4096 is small enough to get some small benefit from the texture cache, but certainly nowhere near the full device mem bandwidth. You need very tight 1D locality for all accesses made within a warp to get that.
I miss some information how many threads you use per block.
E.g. if you were using 512 threads per block, you would be reading 512*256 = 131072 or on average each element 16 times per block.
In that situation youo probably can win a lot by just copying a (fixed) part of the data into shared memory (coalesced, no textures!) and code a switch that takes the data from that shared memory block if it is there and from global memory otherwise (using tex1Dfetch if you want).
Since you do not tell us the details, you will have to decide yourself which method is likely to work best.
the details are:
each thread reads 256 values from array
something like this:
void Kernel(float *array)
float GatherRandom = 0;
int offset = 4096 + (blockIdx.x * 8196);
for (int i = 0; i < 256; i++)
GatherRandom += array[((somerandomfunction()%8192)-4096)+offset];
// do something with GatherRandom
i’v coded version that copies portions of ‘array’ to shared memory in coalesced manner
and then does lockup from all threads, but this gets very tricky since ‘somerandomfunction’ is not trivial and i need to remember it’s states, then restore, then go again
but it is faster than texture cache now.
dear nvidia shared memory really should grow to at least 64kb in near future :D :D :D