Batched 1D linear interpolation

I have two 1D arrays of floats. For each value in the first array I want to read that value and use it as the coordinate for sampling the second array (using linear interpolation). Actually I want to do the same operation (but with different sampling coordinates) for lots of 1D arrays simultaneously so both of my arrays are actually 2D (representing a batch). I want to avoid using textures because the texture fill rate would definitely become a bottleneck. I would like to use the shared memory as cache instead. Unfortunately the 1D arrays will not (in general) fit entirely into shared memory. The algorithm should be optimised for magnifications between 25% and 400%. I’d like to target CUDA 1.0 hardware so it needs to be properly coallesced and use no atomic operations. I think I basically need to write a direct-mapped cache algorithm which is fairly straightforward except I can’t figure out how to make it serialise in the case of a conflict (which will admittedly be pretty rare). Does anyone have any advice on how to do it or where to find relevant examples?

Turns out that I got my sums wrong and that actually textures are faster in this instance although I’m still interesting in any suggestions on how to implement a direct-mapped cache. I’m thinking that one possible approach would be to use a 64 byte page size and maybe 192 cache slots. When a thread needs to read a value from global memory it first determines which page that value is in. It then looks in the appropriate cache slot to see if that page is already loaded. If so then it goes ahead and reads the value it needs from the cache. If not then it writes its chosen page number into a variable associated with the cache slot. Different threads might try to write different page numbers into the same cache slot but only one (selected at random) will be successful. In fact the page selection should be better than random because if 90 threads try to load page A and 10 threads try to load page B then there should be a 90% chance that page A is selected. The threads then cooperate to load all the selected pages. Each thread then looks again in the appropriate cache slot to see if its page is now loaded. If so then it goes ahead and reads the value it needs from the cache. If not then it gives up and reads the value directly from global memory. This last part conveniently solves the problem of conflicts without having to resort to atomics. Does this sound viable?

wow you SURE know your programming. :-). ok i can’t help you i’m sorry. i don’t have a clue. but i really wanna know how long you’ve been doing this kinda stuff and what it takes to do what you do.