What is the benefit from LDG or LDG128?

202476410arsmart · December 30, 2023, 3:08am

I noticed this:

If you are developing for a compute capability 3.5 device you may also want to investigate the LDG instruction which performs read-only global access through the texture cache. The texture cache can have better performance for highly divergent memory accesses and if the application is heavily accessing shared, local, or global memory.

https://forums.developer.nvidia.com/t/switch-off-l1-cache/37274/3

I am wondering where can I find detailed benefits of LDG from some official websites. Thank you!!!

Greg · January 3, 2024, 1:35am

Compute Capability 3.5 (Kepler) GPU had separate L1/SHM and Texture Caches. The LDG instruction enabled loading constant data through the texture cache vs. the L1 cache. This increased the effective “L1” cache space and throughput over only using the L1/SHM cache. The LD (load generic) instruction was used on Kepler to read through the L1 cache.

Kepler LDG information can be found in the Kepler Tuning Guide under section L1 Cache and Read-only Data Cache. There were also numerous GTC presentations on how to use the read-only cache (LDG instruction).

Kepler L1 and Texture Cache

On a L1 cached load a miss to a sector is promoted to a cache line miss (4 x 32B sectors = 128B) potentially resulting in over fetching from L2.
On a L1 un-cached load a miss is not promoted so only missed sectors wi fetched from the L2.
The texture cache will only fetch missed sectors (like un-cached L1).
Kepler SM has 2 (gk208, gk20a) or 4 texture caches (gk10x, gk110, *). There is a fixed relationship between a SM sub-partition (warp scheduler) and a texture cache. Using the LDG instruction to read the same data from all warps (multiple SM-subpartitions) may result in the same data being resident in all texture caches. The texture caches in the SM are not coherent. This is why access can only be read-only. To gain the benefit of the full cache footprint of the N texture caches it is useful to access different addresses per warp.
On a load accessing divergent addresses (different cache lines) the SM warp scheduler has to replay the instruction. This is also true on misses. In contrast address divergence is handled in the texture cache avoiding the loss in math throughput due to instruction relays.
The texture and L1 data cache share the request path to L2 and the return path from L2.

Compute Capability 5.x (Maxwell) - 6.x (Pascal) unified the L1 and Texture cache but moved SHM to a separate unit. The LDG instruction was introduced to force a read through the unified instruction cache to differentiate between generic load and global load. Generic loads have a slight penalty compared to global loads as the LSU unit has to determine if the address is shared, local, or global. Additional serialization is required if threads in the same instruction access multiple address windows (shared, local, and global). It is preferred where possible to use LDS (load shared), LDL (load local), and LDG (load global).

Compute Capability 7.x+ (Volta, Turing, Ampere, Ada, Hopper) unified the L1 Data Cache, Texture Cache, and Shared Memory into a single unit. The ISA matches Compute Capability 5.x above.

202476410arsmart · January 3, 2024, 1:41am

By the way, maybe this is a typo, and should be load global.

Thank you very much!!!

system · January 17, 2024, 1:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about Global memory and Texture memory CUDA Programming and Performance	5	956	October 23, 2014
Do 7.x devices have a readonly constant cache? CUDA Programming and Performance	4	1567	July 30, 2022
Why L1 cache hit ratio become zero on K20? CUDA Programming and Performance	10	5625	January 17, 2013
ldg versus textures CUDA Programming and Performance	5	5987	November 13, 2013
Read-only-data cache, only for Tesla or also for GTX 680? CUDA Programming and Performance	9	3362	August 9, 2013
Maxwell (sm_50) instruction: LDG.E ? CUDA Programming and Performance	25	8540	August 15, 2015
What's different between LD and LDG (load from generic memory vs. load from global memory) CUDA Programming and Performance	10	10839	March 13, 2022
Switch off L1 cache CUDA Programming and Performance	2	3408	March 24, 2015
Reading data CUDA Programming and Performance	12	2700	July 18, 2011
Multiprocessor architecture CUDA Programming and Performance	11	825	November 25, 2020

What is the benefit from LDG or LDG128?

Related topics