Questions about cacheline & sector

half-0 · March 2, 2025, 3:02am

On A100, if I need to load several 32Byte (1 sector) data from gmem, per warp but they are not coalesced to 128B (a cacheline), will it be a 4 times slower then coalesced condition? If so, what is the meaning of sector? If not, what is the advantage of coalesced to 128B?

Curefab · March 2, 2025, 7:58am

No, it will be at full speed.

AFAIK it will take up 128 bytes in the caches.

Could be that it takes up more space in the memory command pipelines before L1, not sure.

You can finetune the prefetch into the caches within the 128 bytes.

In past architectures 128 bytes were more critical.

half-0 · March 2, 2025, 8:01am

So it will not directly make loading/storing slower if I just load for sectors not for cachelines. But what is the different between them? I don’t understand the prefetch mechanism well.

half-0 · March 2, 2025, 8:04am

Does the disadvantages is: if I just load one sector of a cacheline, the whole cacheline will be stored in cache and cause the cache to be full in less time?

Curefab · March 2, 2025, 9:54am

Just concentrate on sectors. The 128 bytes are absolute fine-tuning.

How much it loads depends on the parameters for the load function.

https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld

You can choose nothing (32), 64, 128, 256 bytes. Preloading reduces the latency, but does not improve bandwidth. It even worsens it, if you did not need preloaded data.

If your cache gets full and the size is critical in your algorithm, try to arrange the data to profit from the 128 byte granularity.

njuffa · March 3, 2025, 5:09am

In a sectored cache, each cache line comprises multiple sectors: four 32-byte sectors per 128-byte cache line in the case of NVIDIA GPUs. There is one tag for the entire cache line, but each sector has its own status bits (valid, dirty, etc). This mean that sectors are individually replaceable. However, GPU architectures implement various automated prefetching schemes, where the load of one sector can trigger the load of an adjacent sector or sectors. As I recall, the details of this differ between architectures, and the policy may or may not be programmer configurable.

Prefetching data that is never used can have a negative impact on performance by wasting finite bandwidth. Thus the sparse use of cachelines is not advised. However, before you rush to do something about it, let the profiler be your guide as to whether this is an actionable issue in the specific context of your application.

Topic		Replies	Views
Clarification on the width of Coalesced Memory access for Ampere arch CUDA Programming and Performance	5	113	August 5, 2024
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	110	January 8, 2025
Issues about L1 cache CUDA Programming and Performance	10	68	February 26, 2025
Question about coalesced memory access CUDA Programming and Performance	10	2755	September 24, 2009
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4498	October 24, 2008
why 256byte loads slower than 128byte loads? CUDA Programming and Performance	6	6938	February 11, 2010
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	5964	October 29, 2008
Coalesced? CUDA Programming and Performance	6	2824	February 7, 2009
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5091	September 6, 2008
The granularity of L1 and L2 caches CUDA Programming and Performance cuda	2	1154	April 18, 2024

Questions about cacheline & sector

Related topics