Is there any performance difference between using cudaBindTexture2D and cudaBindTextureToArray when accessing 2D textures? If not, what is the point of using 2D arrays?
Also, are there any row alignment (pitch) requirements when using cudaBindTexture2D? Will it work full speed with any pitch, work but run slower with unaligned rows, or fail to work at all with unaligned rows?
What are the bandwidth difference between cudaMemcpy and cudaMemcpy2D when doing host->device memory transfers? Is it better to align rows on the host and do a simple cudaMemcpy(), or let cudaMemcpy2D() do the gpu memory row alignment?
I guess I should just run some benchmarks myself, but if someone already have figured this out, some comments would be great.