Lookup table Where to implement?

Hi,

I have to implement a lookup table. Which memory is best suited for this? I think Shared memory will be very small. So the choice is between Constant and Texture. Please help me in making a choice here.

Thanks

If every thread in the warp accesses the exact same address in the table, constant memory will work great.

If every thread is independent but tends to make clustered local reads, then use texture memory.

If every thread reads randomly and there’s no locality. you could use texture but it will give the same speed as simple device memory reads, and also pollute the texture cache if you are using it for other textures.

And if none of these solutions are acceptable, then you’ll have to wait for Fermi where the L2 cache will help you.

Will this be a global memory cache? If yes, why would people prefer for texture fetching?

Textures also give you hardware linear interpolation. It’s not full precision, though. AIUI, there are only 255 discrete locations between adjacent texel values, although the interpolation to these locations is done in full single precision.

Without having a Fermi card to play with right now, that is an impossible question to answer :blink:

The Fermi tuning guide suggests that L2/L1 caches have higher bandwidth than the texture caches and thus existing codes that utilize the texture cache for boosting the performance of semi-random reads will be faster if converted to simple global memory reads.

Hypothetically, I can think of one case where the texture cache might be preferred. If you have 2D access locality (i.e. you read across rows and down columns of a matrix semi-randomly), a 2D texture read may give better performance on Fermi than using straight global memory reads. But that is a lot of mights and maybes. All will be revealed via testing in a few weeks.

And most importantly, the L1/L2 caches are much bigger than the texture cache was on compute capability <= 1.3 devices. (Haven’t looked to see how the texture cache is handled on Fermi. I assume it some indexing and interpolation logic between you and the regular caches.) A lookup table up to 768kB in size can be efficiently shared across all the multiprocessors without hitting device memory, whereas that would not have been possible before.