In my use case I will have an input set of 60 matrices (float) of example size (500,700).
I would like to handle 10 such matrices per kernel launch, and would attempt to have 10 distinct textures, in a array form(if possible) like this;
Then bind the current batch of 10 (500,700) matrices to that set of textures before each kernel call, use, unbind and repeat process until done.
Since I only need to do the interpolation across (x,y) I believe I should not use the 3D textures because their interpolation is across(x,y,z).
I already implemented a working application using __ldg() and my own interpolation, but since this is a built in feature of CUDA textures thought that there might be a faster approach.
Did Google this topic before posting, and could not find a specific answer or( even more useful ) a working example. Using textures in such a manner has a more complicated set-up process than handling standard device memory.
How would I go about doing this and would it result in better performance than using __ldg()?
You will find out that (unless things have changed drastically since I last looked) you cannot make an array of textures like your code above shows.
What I have done before is use two textures and select the appropriate one at access time based on coordinates. That approach does not scale to ten textures in high-performance fashion, obviously. If the individual textures have identical sizes, you may be able to combine the data for multiple of them into one larger texture, similar to the way the graphics guys do this.
I am not sure why you would dismiss 3D textures right away. I have used 2D textures before where I just needed to interpolate in the x-dimension. Similarly you can use a 3D texture and just interpolate in one or two dimensions.
If you’re working on cc3.0 or higher hardware, you can make an array of texture objects. As njuffa points out, an array of texture references is difficult or impossible.
It happens to demonstrate an array of 3D textures (i.e. objects), but it should be straightforward to convert it to an array of 2D textures (objects).
If you simply want an example that shows 2D interpolation, I think there are one or more of those in the CUDA samples. The bindless texture and simple texture 3D examples may be of interest.
An alternative to hundreds of texture object handles (each of which requires 8 bytes), would be to use layered cudaArrays, bound to a single texture object. The one drawback is that you have to know in advance what the maximum number of layers will be. Any dynamic resizing requires the creation of a new cudaArray, and re-upload of all the layers.