Hello EveryBody! I have two different kernels that just copies data. First does not uses texture memory, second does. As i supposed texture usage would help to improve bandwidth cause it is cached, but i do not see any difference. Here are kernels:
Thank You in advance!
[codebox]
extern “C” global void kernel1(float *eli1, float *eli2, float *out, int size)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < size) out[tid] = eli1[tid];
Why would a cache help in that situation? Every thread is reading a different value. If anything, I would expect texture would be slower, because of the very large number of cache misses that code will probably generate.
But the size of cache is bigger than 1 value? is it 8k for TPC? so when i read 1 value some neighbour values should be putted into cache. If not so, what does exactly happens?
I think that due to the relaxation of the coalescing rules in current hardware even such pattern might not be faster with textures than with gmem reads.
“On devices of compute capability 1.x, some kernels can achieve a speedup when using (cached) texture fetches rather than regular global memory loads (e.g., when the regular loads do not coalesce well). Unless texture fetches provide other benefits such as address calculations or texture filtering (Section 5.3.2.5), this optimization can be counter-productive on devices of compute capability 2.0, however, since global memory loads are cached in L1 and the L1 cache has higher bandwidth than the texture cache”