Is there any performance difference between using cudaBindTexture2D and cudaBindTextureToArray when accessing 2D textures? If not, what is the point of using 2D arrays?
Also, are there any row alignment (pitch) requirements when using cudaBindTexture2D? Will it work full speed with any pitch, work but run slower with unaligned rows, or fail to work at all with unaligned rows?
What are the bandwidth difference between cudaMemcpy and cudaMemcpy2D when doing host->device memory transfers? Is it better to align rows on the host and do a simple cudaMemcpy(), or let cudaMemcpy2D() do the gpu memory row alignment?
I guess I should just run some benchmarks myself, but if someone already have figured this out, some comments would be great.
we haven’t done the detailed measurements either , but I would share some related experience here
a) once data is in texture memory it does not really matter how it got there. texture memory is cached so alignment does not really matter
b ) I think simple cudaMemcpy works faster than 2D variant, so yes, if you can store it aligned on CPU - do it.
c) the point of using 2D arrays is that they are have stride which is convenient for coalesced memory access patterns when data resize in global memory (note loading from global memory in a continious, coalesced way is almost same fast as from texture)
d) offtopic: by the way new devices are less sensitive to non-coalesced memory access!!!
Yes, there is a difference. cudaBindTextureToArray() uses cudaArrays, which are stored in special memory layout that is optimized for texture fetches with 2D locality. The only problem is that you can’t directly write to cuda arrays (you have to use cudaMemcpyToArray).
cudaBindTexture2D() is a recent addition that allows you to bind any piece of global memory as a 2D texture (we sometimes call this pitch linear texturing). This is convenient since you can write directly to this memory, but since they are laid out linearly the fetch performance can be lower, depending on the access pattern.
I’d recommend testing both to see which is faster.
In short, will I be wasting cache memory when using 2d texture arrays if the threads in my warps (and blocks) are accessing texture elements in a linear (1D) pattern, i.e along the x axis? (each thread in a block will be accessing a different x coordinate along the same row) Will a 2D texture access cache elements from rows above and below? If yes, are these cached elements likely to be reused by other warps later on, or are they likely to be evicted by other linear accesses to the same row my other warps in the block?
I’m not sure my questions are making any sense to you… If not, is it still possible to give some general advice to keep in mind in order to use the available cache memory efficiently when accessing 2d textures?