ldg versus textures

I am in the process of upgrading my code from c1060 to k20c

My main kernel used textures to speed up memory accesses on the c1060.

I have been changing it on the k20c. I have tried 3 approaches.

  1. use textures as on the c1060
  2. use ldg for loads along with const restrict
  3. use const restrict but not ldg

I find 1 faster than 2 faster than 3.

Is this to be expected? The difference in speed between 1 and 2 is substantial, about 25%.

From what I could see __ldg is faster. Note however that option #3 (const restrict) is just
a recommendation to the compiler, so if you want to make use of ldg you must specifically
use __ldg


The LDG instruction is a global memory load that uses the texture path. It has the advantage that it does not require the explicit use of textures. Explicit uses of textures causes a certain amount of code clutter and overhead (e.g. for API calls to bind textures), and textures are objects unfamiliar to many programmers new to CUDA. The introduction of LDG therefore increases the ease of use.

Whether the use of LDG results in higher or lower performance compared to the use of classical textures depends on the use case, I am not aware of any hard-and-fast rule about that. When comparing the performance with LDG vs regular global loads, I have found that use of LDG results in higher performance in most cases, but recall at least one real-life use case where this was not the case.

As eyalhir74 points out, declaring pointers as “const restrict” facilitates, but does not guarantee, the generation of LDG instructions on Kepler-class GPUs. The reason for this is that the use of “const restrict” pointers makes assertions about local read-only behavior, whereas use of LDG requires the data to be read-only for the lifetime of a kernel. Only the use of the __ldg() device function ensures that LDG is generated.


So it would seems that there is little point in changing code from explicit textures to ldg, since the gains
are to do with ease of programming rather than speed. And if one has already done the programming that is no benefit.

In 3.5, there are new texture objects, will these be the same as the old textures for speed?

With regard to existing code that already uses textures explicitly, size restrictions on the textures could be a reason to look into alternative approaches on Kepler. Use cases where multiple textures are needed to map a single chunk of data can be quite cumbersome, in particular if more than two textures are required to cover the data. I do not have experience with texture objects.

I don’t think this is the right conclusion :)

Changing your code to use __ldg is very straight forward, so no real reason why not to test it. Just replace the tex2d (or whatever you use) in the kernel to the __ldg code and see what performance you get.
You can put it in a macro to easily switch between ldg, texture and regular global memory access to see which yields best performance.