cudaBindTexture2D vs cudaBindTextureToArray

Hi,

Is there any performance difference between using cudaBindTexture2D and cudaBindTextureToArray when accessing 2D textures? If not, what is the point of using 2D arrays?

Also, are there any row alignment (pitch) requirements when using cudaBindTexture2D? Will it work full speed with any pitch, work but run slower with unaligned rows, or fail to work at all with unaligned rows?

What are the bandwidth difference between cudaMemcpy and cudaMemcpy2D when doing host->device memory transfers? Is it better to align rows on the host and do a simple cudaMemcpy(), or let cudaMemcpy2D() do the gpu memory row alignment?

I guess I should just run some benchmarks myself, but if someone already have figured this out, some comments would be great.

/Lars

we haven’t done the detailed measurements either , but I would share some related experience here

a) once data is in texture memory it does not really matter how it got there. texture memory is cached so alignment does not really matter

b ) I think simple cudaMemcpy works faster than 2D variant, so yes, if you can store it aligned on CPU - do it.

c) the point of using 2D arrays is that they are have stride which is convenient for coalesced memory access patterns when data resize in global memory (note loading from global memory in a continious, coalesced way is almost same fast as from texture)

d) offtopic: by the way new devices are less sensitive to non-coalesced memory access!!!

Yes, there is a difference. cudaBindTextureToArray() uses cudaArrays, which are stored in special memory layout that is optimized for texture fetches with 2D locality. The only problem is that you can’t directly write to cuda arrays (you have to use cudaMemcpyToArray).

cudaBindTexture2D() is a recent addition that allows you to bind any piece of global memory as a 2D texture (we sometimes call this pitch linear texturing). This is convenient since you can write directly to this memory, but since they are laid out linearly the fetch performance can be lower, depending on the access pattern.

I’d recommend testing both to see which is faster.

Thanks, that’s useful information. I’ll try both versions to see if the memcpy to an array is worth it for me.

I’d like to better understand how 2D locality caching works though… I’m having pretty much the same question as raised in this thread:

http://forums.nvidia.com/index.php?showtop…amp;hl=locality

In short, will I be wasting cache memory when using 2d texture arrays if the threads in my warps (and blocks) are accessing texture elements in a linear (1D) pattern, i.e along the x axis? (each thread in a block will be accessing a different x coordinate along the same row) Will a 2D texture access cache elements from rows above and below? If yes, are these cached elements likely to be reused by other warps later on, or are they likely to be evicted by other linear accesses to the same row my other warps in the block?

I’m not sure my questions are making any sense to you… If not, is it still possible to give some general advice to keep in mind in order to use the available cache memory efficiently when accessing 2d textures?

/Lars

Thank Simon Green,

Your answers exactly what I had thought.

When CUDA 2.2 had released I built a program to evaluate the of performance of cudaBinTexture2D() and cudaBinTextureToArray() functions.

After experimented, I had realized that the speed of accessing data in texture (using cudaBinTextureToArray()) is more faster than another one.