I’m still unfamiliar with the texture memory and with my first texture memory usage some questions came to my mind:
1.:
It seems that cudaMemcpyToArray() is much much slower than a simple cudaMemcpy() without texture memory. If this is correct, can someone try to explain this behaviour?
2.:
I have 2 kernel functions which interact the following way
inputData -> kernel 1 -> intermediate result (resides on the device) -> kernel 2 -> final result (copied back)
Until now, inputData was stored in the global memory. Now it’s stored in the texture memory and kernel 1 speeded up a little bit. But surprisingly kernel 2 is now slower than before. The intermediate result is an extra array in the global memory, so I am not reusing the inputData array. Does someone have an idea why this change has an impact on kernel 2?
3.:
What is the difference between a CUDA array (1D) and linear memory bound to a texture reference?
I tested a little bit, and using CUDA arrays seems to be faster than using linear memory bound to texture. But the performance gain is ruined by cudaMemcpyToArray() (which I need to call many times).
This was true a while ago (CUDA 1.1 IIRC). It was reported as a bug and supposedly fixed. I haven’t checked recently, though. Memory in the cudaArray is stored in a special format, so presumably there is some translation that goes on when you do the memcpy.
Are you calling cudaThreadSyncrhonize() before every wall-clock timing measurement?
Functionally, if you want to use the coordinate wrapping or linear interpolation features: you must use a CUDA array. That is the only advantage that CUDA arrays offer. So if you aren’t using them, stick with global memory bound to a texture.
Today I had some time for testing. It seems that the performance difference between cudaMemcpyToArray() and cudaMemcpy() are not as big as I thought. But the transferred size seems to be significant for the copies:
Transferring more than 10 000 Bytes: cudaMemcpyToArray() is faster than cudaMemcpy().
Transferring less than 10 000 Bytes: cudaMemcpyToArray() is slower than cudaMemcpy().
But as I mentioned, the difference is not so big: I measured max. 1-3 %. The fetching-time (CUDA Array <-> linear memory) was nearly the same…