I’ve been trying to find an answer to this question on the internet, but haven’t found anything definitive. I have 32 threads in a warp that are accessing memory very far apart. From what I’ve read, the best way to do this would be to turn on uncached memory accesses since I don’t want to pull in 128-byte cache lines at a time. My question is:
Should each of the 32 threads read in 32-bytes each or just a single 4-byte float? The reason I’m asking is I can pre-fetch a whole cache line of 32-bytes in each thread and store it in shared memory if it’s just as fast as accessing 4-bytes per thread. Also, I plan to have many threads in-flight, so I don’t think the cached value will last very long, which is why I’m wondering what the largest size I can fetch at once is.
What’s the speed difference between L2 and L1 cache? Is turning on cached accesses and trying to re-order my data faster than uncached?