CUDA texture memory performance

Firstly, I think the texture is cache, why the CUDA user manual said that the speed of transfer shared memory is faster than the speed of texture?

Secondly, you know the texture memory just about 8k per MP, so the total texture cache amount is about 80K, but the texutre image size is about 2^16*2^16, my question is how occupy space of one texel? The user manual said that there is a situation that misses texture cache fetch, therefore, I think whether some texture store in device memory.

Thirdly, there is another problem, somebody who familiar with the example of convolution_Texture in CUDA SDK, but the performance is much lower compared with ipp (intel integrated performance), can give me some advices to speed up the convolution algorithm under CUDA, and you know the separate approach can speed up largely, but you know the transfer time is too large (I mean the result of row need to transfer to device memory , and then deliver to texture again, and calculate the column convolution), the user manual said that the texture data must come from device memory, can anybody have some advices to reduce this transfer consumption, e.g. PBO (I’m not clear that the PBO can improved transfer data).

  1. Texture is first bound to global memory by the application before launching the CUDA kernel

  2. Textures are always read-only. You canNOT write to a texture memory (which is actually bound to global memory)

  3. Texture cache is present in every multi-processor.

  4. Texture should be used when threads of a block accesses different areas of the bound-global memory in a non-orderly fashion.
    a) Totally un-ordered access may or may not help depending on how they are cached.
    b) However if they exhibit spatial locality (2D or 1D) then you need to bind your texture in an appropraite way (1D or 2D) to take advantage.
    c) There are cases where multiple blocks operating in a multi-processor taking advantage of the caches done by the other concurrently executing block on that MP.
    However this case should NOT be counted because Block scheduling is non-deterministic.
    i) For the same reason, it would make sense to run only 1 BLOCK per MP if your kernel is too texture-dependent.

  5. SHared memory access is always faster. Because there is NOTHING called shared-memory miss.
    a) Note that shared memory is actully CPU equivalent of cache.
    B) SM is read/write.
    c) SM is a concsious cache as the program has to do the caching explicitly.
    However CPU caches operate transparent to the application. App never knows what is cached and what is not. It is sub-conscious.

Hope this is clear.

Thanks very much for your reply…

But I am confused about some questions too; if per multiprocessor has 6-8KB texture memory, and the Geforce 8800GTX has 16 MPs that the total texture memory is about 96-128KB on chip, however, you know the 2D texture image size is about 2^16*2^16 that is larger than texture memory on chip, and my question is whether the excess texture store in global memory that induces to the texture cache miss.

The second question is whether has the better trick to avoid the unnecessary data transfer, When the first kernel output (result) is the input of the second kernel, whether has a way to avoid the first kernel result transfer to global memory, that is to say, we transfer the first kernel result to some cache, and the second kernel uses directly, you know, the data transfer is time-consuming in CUDA.

Thanks a lot.

That’s 6-8KB of texture cache, not texture memory. Just like a CPU’s L2 cache is on the order of 1MB but you can use all your 2GB of RAM. Texture memory shares the GPU’s RAM with Global memory (ie. textures are a special kind of global memory).

For your second question - no, there’s no way to do that. Subsequent kernels can only share in/out data through Global memory, all caches (and shared memory) should be considered automatically emptied when the kernel ends. You might consider merging the two kernels if the algorithm allows.

Egg,

You have note that

  1. Textures are READ-ONLY. And As BigMac says, it is a cache. Thats all. It operates on a sub-conscious fashion i.e. caching of the texture access occurs transparently to the application. YOur program cannot dictate what is cached and what is not.

You cannot WRITE into them.