as I know the texture memory is cached and the global memory not.
What does make more sense?
load data into texture memory, perform operation, output is in global memory because texture memory is read only. Then copy from global memory into texture memory for the next step and so on …
or:
load data into texture memory once and then only operate on global memory.
or:
don’t use texture memory if I don’t need features like interpolating or reading uchars as normalized floats?
I can’t estimate how much different this will make in speed.
But if I understand this right: If I copy data from global memory to texture memory I have to access every pixel twice because I need to get the data out of the texture memory again.
So it would be better to avoid device to device copy?
You can directly bind global memory to 1D textures, but “which is better” depends on your access pattern.
My test program shows time cost of:
coalesced global memory read < texture fetch < not-coalesced global memory read
so if you read memory randomly, or may access the same address in multiple threads, or have to read unaligned memory, you may want to use a texture. Otherwise, global memory could be better.
Yes it’s gonna depend of your algorithm. It’s hard to tell. In addition of what asadafag said, I’ve improved by a factor 2 my algorithm by stopping device-device memory. In the beginning I was binding textures to cudaArrays (because texture fetching are optimized for cudaArray). Since it’s not possible to address directly cudaArray within a kernel, I was using a buffer in global memory (allocated with cudaMalloc) and working with it from my kernel. And at the end I was copying the buffer to my cudaArray with a MemcpyToArray. Now I’m directly binding a texture to my buffer. So no need from device-device copy anymore. In my case, it’s much faster.