Tuning a kernel with LDG(ON/OFF,array) and prefetching

I’ve been tuning some kernels that use LDG and, in theory, would benefit from prefetching.

Here is the executive summary for Maxwell v1:

  • LDG loads typically have either small or no impact
  • Prefetching has no noticeable impact.

There you have it. You can stop reading here.

LDG might be worth using in your kernels but prefetching seems unnecessary.

With prefetch enabled, perusing the SASS reveals a “CCTL.E.PF2” op for the L2 prefetch and a “CCTL.E.PF1” for the L1. I wasn’t able to spot the prefetch uniform equivalent but it doesn’t appear to be implemented with CCTL.

I’d definitely like to hear if anyone has achieved a performance win using LDG or prefetch on Maxwell v1/v2.

I’ll dump some code here:


#if __CUDA_ARCH__ >= 320
#define PXL_LDG(g) __ldg(&(g))
#define PXL_LDG(g) (g)

#if __CUDA_ARCH__ >= 320
#define PXL_LDG_IF(p,g) ((p) ? __ldg(&(g)) : (g))
#define PXL_LDG_IF(p,g) (g)

With the PXL_LDG_IF() macro you can create a series of flags at the top of your kernel that you can use to toggle LDG access to particular arrays to see if there is any performance improvement:

#define PXL_LDG_FLAG_FOO         false
#define PXL_LDG_FLAG_BAR         true
#define PXL_LDG_FLAG_BAZ         false


#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
#define PXL_GLOBAL_PTR   "l"
#define PXL_GLOBAL_PTR   "r"

__prefetch_global_l1(const void* const ptr)
  asm("prefetch.global.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));

__prefetch_global_uniform(const void* const ptr)
  asm("prefetchu.L1 [%0];" : : PXL_GLOBAL_PTR(ptr));

__prefetch_global_l2(const void* const ptr)
  asm("prefetch.global.L2 [%0];" : : PXL_GLOBAL_PTR(ptr));

Software prefetch is often useless, on most any platform. The reasons for this are usually related to complicated interactions between prefetch distance and cache / prefetch buffer sizes, and how these change with every modification of the micro architecture. Back in the day I was one of the people who lobbied for hardware prefetch in AMD Athlon processors after recognizing these drawbacks. As far as I know, all modern x86 processors include hardware prefetch, and are even able to follow various access patterns beyond contiguous incrementing (e.g. negative increments, strided)

As for NVIDIA GPUs specifically, I have never seen any benefit from using software prefetch, which does not mean there could not be an instance where someone used it successfully, in which case I’d be interested hearing about details.

As for LDG, its use results in a performance improvement in most cases, as compared with the regular load LD.E. However, the compiler will try its level best to generate LDG automatically, so when you do a performance comparison with hand-coded __ldg() you would want to check the SASS code to see whether the default compilation already resulted in the use of LDG, in which case one would obviously not expect additional performance improvements from use of the __ldg() intrinsic.

Lastly, cache structures and load paths were re-configured for Maxwell (lot of unification of previously separate structures, although I am not knowledgeable about the details), so it seems possible that the use of LDG no longer provides much of a specific performance advantage. But I haven’t done any experiments to examine that trade-off on Maxwell, so this is speculation.

While working on a recent project I found a modest performance benefit when I switched from using __ldg() explicitly to decorating the pointers with the ‘restrict’ and ‘const restrict’ qualifiers. This is on Maxwell 5.2 devices.

Have not yet looked at the SASS output for each version, but hope to get to that today.

Have you compared using ‘const restrict’ to __ldg() and found any interesting differences?

Unfortunately, it has been my experience that decorating pointers with ‘const restrict’ rarely gives me what I want.

I’ve since moved entirely to explicit and toggle’able LDG() ops so I can confidently determine if there is any impact.

Note that use of restrict often allows more aggressive scheduling of loads (in particular moving a load forward across a store), independent of any effect it has on the generation of LDG instructions. Assuming there is no adverse performance effect caused by higher register pressure, such improved load scheduling can well provide a performance benefit by increasing the latency tolerance of the code.

Good point.

I definitely wasn’t stating that restrict has no impact, just that I’ve found “const restrict” doesn’t guarantee LDG() loads will be generated (as we’ve discussed elsewhere in this forum).

I will re-iterate what I have stated previously: The use of “const restrict” is not a hint to the compiler to generate LDG. It merely supplies additional information to the compiler that can be used for various optimization purposes; as far as LDG generation is concerned, this information is “necessary but may not be sufficient”.

Restricted pointers were added to C (still absent from the C++ standard for reasons unknown) to eliminate the performance disparity between Fortran code and equivalent C code. Fortran assumes a lack of aliasing by design, allowing much more aggressive handling of loads, giving it an inherent performance advantage. The addition of restricted pointers in C99 gave C programmers a means of asserting absence of aliasing to the compiler (in addition most C/C++ compilers, including nvcc I believe, have a command line switch that asserts non-aliasing across the entire compilation unit).

I will now and forever use the phrase “determines eligibility” instead of “hint” when discussing “const restrict”.


I like that :-) Sorry to be such a stickler for precise language here (and thus possibly a pain in the neck), but with regard to this particular issue I feel it is important to set correct expectations by clearly pointing out what “const restrict” does and does not do. I have had to deal with way too many disappointed “but where is the promised LDG?” messages.

It was NVIDIA-circa-2012’s fault for making people think it would always work. :)

The first slide gets it right, the second implies it’s guaranteed:

This was way back in the Kepler days and now it’s much more carefully described:

I know. Creating crisp statements for slides can easily conflict with attention to technical nuances, and by the time anybody notices that incorrect expectations were raised in programmers’ minds it is super difficult to clarify such information. Another such case was the topic of warp-synchronous programming.

Personally I have always tried hard to stick to the conservative side of such messaging, trying to ensure that technical details do not get lost and there is no implicit over-promising. But then I never had to make slides or give presentations …

Has anyone seen any performance improvement from the prefetch instruction?

I’ve tried prefetch on my device, but it turns out to hurt performance. Due to extra IMAD and CCTL instructions.

I have got 10%-30% in many instances with very low occupancy kernels where long scoreboard stalls are very high.

1 Like