Texture vs Global Memory Bandwidth

Hello EveryBody! I have two different kernels that just copies data. First does not uses texture memory, second does. As i supposed texture usage would help to improve bandwidth cause it is cached, but i do not see any difference. Here are kernels:

Thank You in advance!


extern “C” global void kernel1(float *eli1, float *eli2, float *out, int size)


int tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < size) out[tid] = eli1[tid];


texture<float, 1, cudaReadModeElementType> texref1;

extern “C” global void kernel2(float *out, int size)


int tid = blockIdx.x * blockDim.x + threadIdx.x;

if (tid < size) out[tid] = tex1Dfetch(texref1, tid);



Why would a cache help in that situation? Every thread is reading a different value. If anything, I would expect texture would be slower, because of the very large number of cache misses that code will probably generate.

But the size of cache is bigger than 1 value? is it 8k for TPC? so when i read 1 value some neighbour values should be putted into cache. If not so, what does exactly happens?

The texture cache is not like CPU L1/L2/… cache.

You might see improvement using textures if the access pattern within the block is random or semi random, for example:

float fCurrent = eli[threadIdx.x ];

float fNext = eli[ threadIdx.x + 1 ];

I think that due to the relaxation of the coalescing rules in current hardware even such pattern might not be faster with textures than with gmem reads.


So on Fermi there would be never advantage in using texture for just reading elements (i mean no filtration, no wrap modes and so on)?

This is kinda suggested in Fermi’s Tunning Guide:

“On devices of compute capability 1.x, some kernels can achieve a speedup when using (cached) texture fetches rather than regular global memory loads (e.g., when the regular loads do not coalesce well). Unless texture fetches provide other benefits such as address calculations or texture filtering (Section, this optimization can be counter-productive on devices of compute capability 2.0, however, since global memory loads are cached in L1 and the L1 cache has higher bandwidth than the texture cache”


But we’ll have to wait and see :)