CUDA texture memory performance

egg · January 12, 2009, 11:07am

Firstly, I think the texture is cache, why the CUDA user manual said that the speed of transfer shared memory is faster than the speed of texture?

Secondly, you know the texture memory just about 8k per MP, so the total texture cache amount is about 80K, but the texutre image size is about 2^16*2^16, my question is how occupy space of one texel? The user manual said that there is a situation that misses texture cache fetch, therefore, I think whether some texture store in device memory.

Thirdly, there is another problem, somebody who familiar with the example of convolution_Texture in CUDA SDK, but the performance is much lower compared with ipp (intel integrated performance), can give me some advices to speed up the convolution algorithm under CUDA, and you know the separate approach can speed up largely, but you know the transfer time is too large (I mean the result of row need to transfer to device memory , and then deliver to texture again, and calculate the column convolution), the user manual said that the texture data must come from device memory, can anybody have some advices to reduce this transfer consumption, e.g. PBO (Iâ€™m not clear that the PBO can improved transfer data).

Sarnath · January 12, 2009, 12:07pm

Texture is first bound to global memory by the application before launching the CUDA kernel
Textures are always read-only. You canNOT write to a texture memory (which is actually bound to global memory)
Texture cache is present in every multi-processor.
Texture should be used when threads of a block accesses different areas of the bound-global memory in a non-orderly fashion.
a) Totally un-ordered access may or may not help depending on how they are cached.
b) However if they exhibit spatial locality (2D or 1D) then you need to bind your texture in an appropraite way (1D or 2D) to take advantage.
c) There are cases where multiple blocks operating in a multi-processor taking advantage of the caches done by the other concurrently executing block on that MP.
However this case should NOT be counted because Block scheduling is non-deterministic.
i) For the same reason, it would make sense to run only 1 BLOCK per MP if your kernel is too texture-dependent.
SHared memory access is always faster. Because there is NOTHING called shared-memory miss.
a) Note that shared memory is actully CPU equivalent of cache.
B) SM is read/write.
c) SM is a concsious cache as the program has to do the caching explicitly.
However CPU caches operate transparent to the application. App never knows what is cached and what is not. It is sub-conscious.

Hope this is clear.

egg · January 12, 2009, 2:16pm

Texture is first bound to global memory by the application before launching the CUDA kernel

Textures are always read-only. You canNOT write to a texture memory (which is actually bound to global memory)

Texture cache is present in every multi-processor.
Texture should be used when threads of a block accesses different areas of the bound-global memory in a non-orderly fashion.

a) Totally un-ordered access may or may not help depending on how they are cached.

B) However if they exhibit spatial locality (2D or 1D) then you need to bind your texture in an appropraite way (1D or 2D) to take advantage.

c) There are cases where multiple blocks operating in a multi-processor taking advantage of the caches done by the other concurrently executing block on that MP.
 However this case should NOT be counted because Block scheduling is non-deterministic.

 i) For the same reason, it would make sense to run only 1 BLOCK per MP if your kernel is too texture-dependent.
SHared memory access is always faster. Because there is NOTHING called shared-memory miss.

a) Note that shared memory is actully CPU equivalent of cache.

B) SM is read/write.

c) SM is a concsious cache as the program has to do the caching explicitly.
 However CPU caches operate transparent to the application. App never knows what is cached and what is not. It is sub-conscious.
Hope this is clear.

Thanks very much for your replyâ€¦

But I am confused about some questions too; if per multiprocessor has 6-8KB texture memory, and the Geforce 8800GTX has 16 MPs that the total texture memory is about 96-128KB on chip, however, you know the 2D texture image size is about 2^16*2^16 that is larger than texture memory on chip, and my question is whether the excess texture store in global memory that induces to the texture cache miss.

The second question is whether has the better trick to avoid the unnecessary data transfer, When the first kernel output (result) is the input of the second kernel, whether has a way to avoid the first kernel result transfer to global memory, that is to say, we transfer the first kernel result to some cache, and the second kernel uses directly, you know, the data transfer is time-consuming in CUDA.

Thanks a lot.

_Big_Mac · January 12, 2009, 4:40pm

That’s 6-8KB of texture cache, not texture memory. Just like a CPU’s L2 cache is on the order of 1MB but you can use all your 2GB of RAM. Texture memory shares the GPU’s RAM with Global memory (ie. textures are a special kind of global memory).

For your second question - no, there’s no way to do that. Subsequent kernels can only share in/out data through Global memory, all caches (and shared memory) should be considered automatically emptied when the kernel ends. You might consider merging the two kernels if the algorithm allows.

Sarnath · January 13, 2009, 4:51am

Egg,

You have note that

Textures are READ-ONLY. And As BigMac says, it is a cache. Thats all. It operates on a sub-conscious fashion i.e. caching of the texture access occurs transparently to the application. YOur program cannot dictate what is cached and what is not.

You cannot WRITE into them.

Topic		Replies	Views
CUDA texture memory performance CUDA Programming and Performance	0	1285	January 12, 2009
Copy from texture memory to shared memory Confused about best transfer strategy CUDA Programming and Performance	4	1651	February 11, 2010
Shared Memory usage slows kernel with texture fetch CUDA Programming and Performance	8	4276	June 20, 2011
Confusion on using texture? CUDA Programming and Performance	14	5079	September 4, 2009
Convenience of 2D CUDA texture memory against global memory CUDA Programming and Performance	4	4404	January 21, 2013
About texture cache and spatial locality CUDA Programming and Performance	15	11445	July 24, 2009
texture vs global memory CUDA Programming and Performance	0	2916	December 16, 2009
texture memory vs global memory CUDA Programming and Performance	10	13947	August 20, 2007
Texture memory when to use ? CUDA Programming and Performance	6	21341	October 7, 2009
Texture memory performance CUDA Programming and Performance	4	5040	June 1, 2009

CUDA texture memory performance

Related topics