I have a question regarding irregular memory access and caching.The memory coalescer of the Load/Store unit can only work efficiently if the memory access pattern is regular so that threads access contiguous memory space. What is the approach of the GPU when it comes to caching data of irregular memory accesses? Can the GPU still somehow use the cache efficiently? I am not very familiar with this subject, so it might be that I misunderstood some concepts. Any explanations are appreciated.
The GPU will request lines (from the cache) or segments (from memory) as needed, to satisfy the addresses requested across the warp. Cache lines are either 128 or 32 bytes, and memory segments are 32 bytes, for all CUDA GPU architectures I am familiar with.
Therefore, you can figure out what will be present or populated in the cache by determining which 32-byte memory segments will be retrieved, to satisfy a particular load or store request.
There is no reason to assume anything else gets cached (speculative prefetching) as a result of the transactions themselves.
Suppose memory segments are arranged starting from address 0, in 32 byte groups.
Suppose across a warp, we request an
int value from
int index 2 and an
int value from
int index 1024.
After those transactions are serviced, I would expect the bytes from 0…31 and the bytes from 4096…4127 to be resident in the L2 cache.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.