I am trying to check different memory performance of digital image processing.
The main idea behind digital filters is a “moving window”, each processed pixel depends on its neighbours (x±1, y±1).
When i use standard, global memory I get a standard execution time. I guess the memory reads and writes are coalesced, because all threads do the same thing, no flow separation takes place.
Texture memory (2D) is … unfortunately twice slower. Very disappointing, since it is stated to be FASTER in such local-memory-dependent applications.
Is it possible or did I probably make a terrible mistake somewhere in the code?