On A100, if I need to load several 32Byte (1 sector) data from gmem, per warp but they are not coalesced to 128B (a cacheline), will it be a 4 times slower then coalesced condition? If so, what is the meaning of sector? If not, what is the advantage of coalesced to 128B?
No, it will be at full speed.
AFAIK it will take up 128 bytes in the caches.
Could be that it takes up more space in the memory command pipelines before L1, not sure.
You can finetune the prefetch into the caches within the 128 bytes.
In past architectures 128 bytes were more critical.
So it will not directly make loading/storing slower if I just load for sectors not for cachelines. But what is the different between them? I don’t understand the prefetch mechanism well.
Does the disadvantages is: if I just load one sector of a cacheline, the whole cacheline will be stored in cache and cause the cache to be full in less time?
Just concentrate on sectors. The 128 bytes are absolute fine-tuning.
How much it loads depends on the parameters for the load function.
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-ld
You can choose nothing (32), 64, 128, 256 bytes. Preloading reduces the latency, but does not improve bandwidth. It even worsens it, if you did not need preloaded data.
If your cache gets full and the size is critical in your algorithm, try to arrange the data to profit from the 128 byte granularity.
In a sectored cache, each cache line comprises multiple sectors: four 32-byte sectors per 128-byte cache line in the case of NVIDIA GPUs. There is one tag for the entire cache line, but each sector has its own status bits (valid, dirty, etc). This mean that sectors are individually replaceable. However, GPU architectures implement various automated prefetching schemes, where the load of one sector can trigger the load of an adjacent sector or sectors. As I recall, the details of this differ between architectures, and the policy may or may not be programmer configurable.
Prefetching data that is never used can have a negative impact on performance by wasting finite bandwidth. Thus the sparse use of cachelines is not advised. However, before you rush to do something about it, let the profiler be your guide as to whether this is an actionable issue in the specific context of your application.