Uncoalesced reading using float4 can decrease wasting?

I need to read global memory values with strides, such as a[0], a[10], a[20], etc. Obviously, this will result in non-coalesced accesses, likely due to the inefficient use of L2 cache lines. So, does that mean the time taken to read a[0], a[10], … would be the same as reading a0 a[1] a[2] a[3], …a10 a[11] a[12] a[13], …? (The additional values will be useful later on, but I don’t want to waste too much time during the initial read.) I’m thinking that if reading the extra three values doesn’t increase the time, then I would definitely prefer to use float4. Is my understanding correct?

The following would be generally my understanding/expectation, on a “modern” GPU architecture e.g. Pascal or newer.

Assuming a[0] refers to a 4-byte type, then if thread 0 reads a[0], and the a[0] location refers to a properly-16-byte-aligned location, it should not matter whether you read a[0] as a 4-byte read or a[0] as a 16-byte read.

Assuming the data requested is not already in the L1 or L2 cache, then the memory controller will retrieve a minimum of 32 bytes to satisfy the read of a[0] by thread 0.

Note that if a[0] is a 4-byte quantity, and a[0] is properly aligned for a 16-byte load, and the indexing depicted is base-10, then a[10] is not properly aligned for a 16-byte load.

The actual benefit of such a change, if any, is questionable.

  1. It depends on cache temporal locality. The extra bytes read, if used later, may or may not persist in the cache. If they are not evicted then subsequent use of the data should benefit from the cache.

  2. At least for the L2, anyway, why should any of this matter? Whether you request 4 bytes, or request 16 bytes, the memory controller will retrieve a minimum of a 32-byte L2 sector that encompasses the area. The memory traffic, and L2 footprint should not change. Therefore, if the data is not evicted, you should still get L2 cache benefit from having retrieved the “extra” data, whether you request 4 bytes, or 16 bytes.

1 Like

Thank you!!! Actually, I will store all the data in shared memory for future usage! This is the first stage of pipeline, so I do not want to spend too much time (because no computation to cover it). So I think the answer is “yes”. Read 4 data will spend almost the same time to 1 data, but have 3 more data for future usage! (I guess this is what you mean)

Thank you again for your fast response!!!