what's the benefit of using texture memory in Fermi verus using global memory

humorstar · August 8, 2010, 11:03pm

In fermi architecture, the global memory and local memory are all cached, just as texture memory and constant memory are cached. In the past, I often copy data from global memory to texture memory because the access speed is faster especially when the read is uncoalesced. Now the global memory is also cached, does that mean the performance on global memory and texture memory will be similar?

Thank you,

seibert · August 8, 2010, 11:13pm

I think the only remaining advantages for texture memory are the hardware accelerated normalized coordinates, auto-normalization of texture values, and linear interpolation.

seibert · August 8, 2010, 11:13pm

I think the only remaining advantages for texture memory are the hardware accelerated normalized coordinates, auto-normalization of texture values, and linear interpolation.

MisterAnderson42 · August 8, 2010, 11:52pm

Most kernels I’ve tried have benefited from switching from tex1Dfetch reads to global memory reads. At least one experiences reduced performance with the switch - so apparently some memory access patterns are still better handled by tex1Dfetch.

MisterAnderson42 · August 8, 2010, 11:52pm

Most kernels I’ve tried have benefited from switching from tex1Dfetch reads to global memory reads. At least one experiences reduced performance with the switch - so apparently some memory access patterns are still better handled by tex1Dfetch.

seibert · August 9, 2010, 12:54am

Oh, and I forgot, textures are still the only way to access cudaArrays, which are useful when you have 2D and 3D spatial locality.

seibert · August 9, 2010, 12:54am

Oh, and I forgot, textures are still the only way to access cudaArrays, which are useful when you have 2D and 3D spatial locality.

Quoc_Vinh · August 9, 2010, 2:17am

Oh, excellent question. I have never tried to Fermi. But I think using texture still have many benefits in accessing datas located in CUDAArray (optimize for 2D and 3D).

I appreciate if you or somebody around here do a comparison in accessing speed between texture and Global memory.

Quoc_Vinh · August 9, 2010, 2:17am

Oh, excellent question. I have never tried to Fermi. But I think using texture still have many benefits in accessing datas located in CUDAArray (optimize for 2D and 3D).

I appreciate if you or somebody around here do a comparison in accessing speed between texture and Global memory.

eyalhir74 · August 9, 2010, 7:23am

I think Mr Anderson is correct - its probably dependant on the access pattern.

I saw 20-30% degragation going from textures to gmem (i.e. textures were faster), even with L1.

My access pattern is mostly like this:

float fValue = tex1DFetch( tex1, iIndex ) * w1 + tex1DFetch( tex1, iIndex + 1 );  //or get a float2 out of the textures....

For me L1 seems just to make things worse (increasing L1 to 48K and reducing Shared mem to 16K also reduced occupancy and

resulted in slower performance) - I guess it was mostly usefull for register spilling enabling me to use the -maxrregcount and

thus increase occupancy by moving spilled registers to the L1.

eyal

eyalhir74 · August 9, 2010, 7:23am

I think Mr Anderson is correct - its probably dependant on the access pattern.

I saw 20-30% degragation going from textures to gmem (i.e. textures were faster), even with L1.

My access pattern is mostly like this:

float fValue = tex1DFetch( tex1, iIndex ) * w1 + tex1DFetch( tex1, iIndex + 1 );  //or get a float2 out of the textures....

For me L1 seems just to make things worse (increasing L1 to 48K and reducing Shared mem to 16K also reduced occupancy and

resulted in slower performance) - I guess it was mostly usefull for register spilling enabling me to use the -maxrregcount and

thus increase occupancy by moving spilled registers to the L1.

eyal

MisterAnderson42 · August 9, 2010, 12:26pm

Some more thoughts on this I just remembered:
I asked Michael Garland about textures vs. L1 in his presentation for at the VSCSE summer school last week. He confirmed what we are saying here, that sometimes L1 is better and sometimes the tex cache was better for the sparse matrix vector multiply kernels he works on. The interesting thing he added is this: added benefits are possible by making use of both caches in a single kernel. They are independent caches, after all! The idea is to read from one array with tex1Dfetch (or tex2D/3D) and from the others with L1. 1) It limits the L1 cache pollution and 2) It gives you a larger amount of cache memory to read from.

I’ve only got one kernel that performs cached reads from 2 different arrays which I can try this idea out on - it did lead to a slight performance improvement. The improvement likely wasn’t that great because the 2nd array read is not in the inner loop and only performed once for every ~30-40 inner loop random reads.

It is too bad that the tex cache is so shrouded in secrecy that we can’t know what access patterns work well for it. Even a cache line size would be something!

MisterAnderson42 · August 9, 2010, 12:26pm

Some more thoughts on this I just remembered:
I asked Michael Garland about textures vs. L1 in his presentation for at the VSCSE summer school last week. He confirmed what we are saying here, that sometimes L1 is better and sometimes the tex cache was better for the sparse matrix vector multiply kernels he works on. The interesting thing he added is this: added benefits are possible by making use of both caches in a single kernel. They are independent caches, after all! The idea is to read from one array with tex1Dfetch (or tex2D/3D) and from the others with L1. 1) It limits the L1 cache pollution and 2) It gives you a larger amount of cache memory to read from.

I’ve only got one kernel that performs cached reads from 2 different arrays which I can try this idea out on - it did lead to a slight performance improvement. The improvement likely wasn’t that great because the 2nd array read is not in the inner loop and only performed once for every ~30-40 inner loop random reads.

It is too bad that the tex cache is so shrouded in secrecy that we can’t know what access patterns work well for it. Even a cache line size would be something!

Topic		Replies	Views
Relevance of tex2D() on Fermi Tex instructions are less important on Fermi, but are they obsolete? CUDA Programming and Performance	6	2558	March 24, 2011
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2255	July 19, 2010
Why texture/constant memory under FERMI architecture CUDA Programming and Performance	23	4021	November 3, 2010
Texture vs Global Memory Bandwidth CUDA Programming and Performance	5	6559	March 25, 2010
Memory performance in image processing example CUDA Programming and Performance	9	1603	March 24, 2011
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9795	December 14, 2011
Why tex1Dfetch faster in 10-15 times then a global memory ? tex1Dfetch faster CUDA Programming and Performance	6	846	January 3, 2012
2D spatial locality for L2 cache on Fermi CUDA Programming and Performance	8	2411	January 19, 2011
Texture vs Global memory which of this is faster? CUDA Programming and Performance	2	5463	August 18, 2011
Convenience of 2D CUDA texture memory against global memory CUDA Programming and Performance	4	4321	January 21, 2013

what's the benefit of using texture memory in Fermi verus using global memory

Related topics