I need to do 1D lerp from a great number of 1D float vectors (maybe up to millions). Every vector is of the same length, typically several hundred to a thousand. Every block will process many vectors (randomly), sampling points inside vectors are usually dense, but no locality between vectors.
I want to utilize the texture units the achieve this. However, one 1D-layered tex object can support an array up to 16384x2048. Then I will need hundreds of tex objectes to create. Nevertheless, vectors will be updated in another kernel for another lerps. Since layered textures are only supported by cudaArray, re-copying seems mandatory. I was considering updating the cudaArray by surface objectes, but it does not support atomic reduction, which is needed in my algorithm. Currently I’m trying to use arrays of 2D texture objects with pitched linear memory, but I’m not sure it’s the right choice.
When I was considering those possible approaches, I realized I knew so few about details deep inside texture memory, thus I have some statements according to my understanding. It’s really appreciated that someone who has better knowledge about this could help to confirm them, or clarify my misunderstandings. Thanks very much!
Here are my understandings and questions:
The elements of cudaArrays follow Z-order curve pattern (at stated by another thread here). Thus even the interpolation will never across layeres, the locality between layers are still needed for efficient use of memory bandwith. This is also the case for 2D texture with cudaArray, even when the lerp is always along one axis (say, one coordinate is i+0.5f). Not sure for 2D texture with pitched memory.
The bandwith of texture memory is shared with normal global reads. The texture is likely to save the bandwidth only for cache hits, or better coalescing due to spacial locality of Z-order curve.
As stated in “https://devblogs.nvidia.com/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/”, a kernel can support up to 1 million texture objects. I’ve checked the value of cudaTextureObject_t, it’s numbered sequencially as 1,2,… There must be some context dependent variables initialized when creating them. Is that possible to call cudaCreateTextureObject or cudaDestroyTextureObject with a long continous sequence rather than one by one? The resource descriptions are identical, the only difference is the pointer address.
Actually, if statement 1 is not true, using layered texture may behave identically with texture object array, probably better performance. Since creating many texture objects needs more context resources.
- The precision of texture interpolation is 8bit, regardless of the coordinats range. Is that possible to have better precision?
The texture unit is a common and powerful approach in graphics, but sometimes we need more flexibilities to better utilize it in scientific computing.