Question about texture memory

Hi there,

I’m still unfamiliar with the texture memory and with my first texture memory usage some questions came to my mind:


It seems that cudaMemcpyToArray() is much much slower than a simple cudaMemcpy() without texture memory. If this is correct, can someone try to explain this behaviour?


I have 2 kernel functions which interact the following way

inputData -> kernel 1 -> intermediate result (resides on the device) -> kernel 2 -> final result (copied back)

Until now, inputData was stored in the global memory. Now it’s stored in the texture memory and kernel 1 speeded up a little bit. But surprisingly kernel 2 is now slower than before. The intermediate result is an extra array in the global memory, so I am not reusing the inputData array. Does someone have an idea why this change has an impact on kernel 2?


What is the difference between a CUDA array (1D) and linear memory bound to a texture reference?

I tested a little bit, and using CUDA arrays seems to be faster than using linear memory bound to texture. But the performance gain is ruined by cudaMemcpyToArray() (which I need to call many times).

Could someone explain that?

Thanks in advance…

This was true a while ago (CUDA 1.1 IIRC). It was reported as a bug and supposedly fixed. I haven’t checked recently, though. Memory in the cudaArray is stored in a special format, so presumably there is some translation that goes on when you do the memcpy.

Are you calling cudaThreadSyncrhonize() before every wall-clock timing measurement?

Functionally, if you want to use the coordinate wrapping or linear interpolation features: you must use a CUDA array. That is the only advantage that CUDA arrays offer. So if you aren’t using them, stick with global memory bound to a texture.

Thanks for your reply, it helps a lot!

Today I had some time for testing. It seems that the performance difference between cudaMemcpyToArray() and cudaMemcpy() are not as big as I thought. But the transferred size seems to be significant for the copies:

Transferring more than 10 000 Bytes: cudaMemcpyToArray() is faster than cudaMemcpy().

Transferring less than 10 000 Bytes: cudaMemcpyToArray() is slower than cudaMemcpy().

But as I mentioned, the difference is not so big: I measured max. 1-3 %. The fetching-time (CUDA Array <-> linear memory) was nearly the same…