what's the benefit of using texture memory in Fermi verus using global memory

In fermi architecture, the global memory and local memory are all cached, just as texture memory and constant memory are cached. In the past, I often copy data from global memory to texture memory because the access speed is faster especially when the read is uncoalesced. Now the global memory is also cached, does that mean the performance on global memory and texture memory will be similar?

Thank you,

I think the only remaining advantages for texture memory are the hardware accelerated normalized coordinates, auto-normalization of texture values, and linear interpolation.

I think the only remaining advantages for texture memory are the hardware accelerated normalized coordinates, auto-normalization of texture values, and linear interpolation.

Most kernels I’ve tried have benefited from switching from tex1Dfetch reads to global memory reads. At least one experiences reduced performance with the switch - so apparently some memory access patterns are still better handled by tex1Dfetch.

Most kernels I’ve tried have benefited from switching from tex1Dfetch reads to global memory reads. At least one experiences reduced performance with the switch - so apparently some memory access patterns are still better handled by tex1Dfetch.

Oh, and I forgot, textures are still the only way to access cudaArrays, which are useful when you have 2D and 3D spatial locality.

Oh, and I forgot, textures are still the only way to access cudaArrays, which are useful when you have 2D and 3D spatial locality.

Oh, excellent question. I have never tried to Fermi. But I think using texture still have many benefits in accessing datas located in CUDAArray (optimize for 2D and 3D).

I appreciate if you or somebody around here do a comparison in accessing speed between texture and Global memory.

Oh, excellent question. I have never tried to Fermi. But I think using texture still have many benefits in accessing datas located in CUDAArray (optimize for 2D and 3D).

I appreciate if you or somebody around here do a comparison in accessing speed between texture and Global memory.

I think Mr Anderson is correct - its probably dependant on the access pattern.

I saw 20-30% degragation going from textures to gmem (i.e. textures were faster), even with L1.

My access pattern is mostly like this:

float fValue = tex1DFetch( tex1, iIndex ) * w1 + tex1DFetch( tex1, iIndex + 1 );  //or get a float2 out of the textures....

For me L1 seems just to make things worse (increasing L1 to 48K and reducing Shared mem to 16K also reduced occupancy and

resulted in slower performance) - I guess it was mostly usefull for register spilling enabling me to use the -maxrregcount and

thus increase occupancy by moving spilled registers to the L1.

eyal

I think Mr Anderson is correct - its probably dependant on the access pattern.

I saw 20-30% degragation going from textures to gmem (i.e. textures were faster), even with L1.

My access pattern is mostly like this:

float fValue = tex1DFetch( tex1, iIndex ) * w1 + tex1DFetch( tex1, iIndex + 1 );  //or get a float2 out of the textures....

For me L1 seems just to make things worse (increasing L1 to 48K and reducing Shared mem to 16K also reduced occupancy and

resulted in slower performance) - I guess it was mostly usefull for register spilling enabling me to use the -maxrregcount and

thus increase occupancy by moving spilled registers to the L1.

eyal

Some more thoughts on this I just remembered:
I asked Michael Garland about textures vs. L1 in his presentation for at the VSCSE summer school last week. He confirmed what we are saying here, that sometimes L1 is better and sometimes the tex cache was better for the sparse matrix vector multiply kernels he works on. The interesting thing he added is this: added benefits are possible by making use of both caches in a single kernel. They are independent caches, after all! The idea is to read from one array with tex1Dfetch (or tex2D/3D) and from the others with L1. 1) It limits the L1 cache pollution and 2) It gives you a larger amount of cache memory to read from.

I’ve only got one kernel that performs cached reads from 2 different arrays which I can try this idea out on - it did lead to a slight performance improvement. The improvement likely wasn’t that great because the 2nd array read is not in the inner loop and only performed once for every ~30-40 inner loop random reads.

It is too bad that the tex cache is so shrouded in secrecy that we can’t know what access patterns work well for it. Even a cache line size would be something!

Some more thoughts on this I just remembered:
I asked Michael Garland about textures vs. L1 in his presentation for at the VSCSE summer school last week. He confirmed what we are saying here, that sometimes L1 is better and sometimes the tex cache was better for the sparse matrix vector multiply kernels he works on. The interesting thing he added is this: added benefits are possible by making use of both caches in a single kernel. They are independent caches, after all! The idea is to read from one array with tex1Dfetch (or tex2D/3D) and from the others with L1. 1) It limits the L1 cache pollution and 2) It gives you a larger amount of cache memory to read from.

I’ve only got one kernel that performs cached reads from 2 different arrays which I can try this idea out on - it did lead to a slight performance improvement. The improvement likely wasn’t that great because the 2nd array read is not in the inner loop and only performed once for every ~30-40 inner loop random reads.

It is too bad that the tex cache is so shrouded in secrecy that we can’t know what access patterns work well for it. Even a cache line size would be something!