Uncoalesced reading using float4 can decrease wasting?

202476410arsmart · October 16, 2023, 4:37pm

I need to read global memory values with strides, such as a[0], a[10], a[20], etc. Obviously, this will result in non-coalesced accesses, likely due to the inefficient use of L2 cache lines. So, does that mean the time taken to read a[0], a[10], … would be the same as reading a0 a[1] a[2] a[3], …a10 a[11] a[12] a[13], …? (The additional values will be useful later on, but I don’t want to waste too much time during the initial read.) I’m thinking that if reading the extra three values doesn’t increase the time, then I would definitely prefer to use float4. Is my understanding correct?

Robert_Crovella · October 16, 2023, 4:53pm

The following would be generally my understanding/expectation, on a “modern” GPU architecture e.g. Pascal or newer.

Assuming a[0] refers to a 4-byte type, then if thread 0 reads a[0], and the a[0] location refers to a properly-16-byte-aligned location, it should not matter whether you read a[0] as a 4-byte read or a[0] as a 16-byte read.

Assuming the data requested is not already in the L1 or L2 cache, then the memory controller will retrieve a minimum of 32 bytes to satisfy the read of a[0] by thread 0.

Note that if a[0] is a 4-byte quantity, and a[0] is properly aligned for a 16-byte load, and the indexing depicted is base-10, then a[10] is not properly aligned for a 16-byte load.

The actual benefit of such a change, if any, is questionable.

It depends on cache temporal locality. The extra bytes read, if used later, may or may not persist in the cache. If they are not evicted then subsequent use of the data should benefit from the cache.
At least for the L2, anyway, why should any of this matter? Whether you request 4 bytes, or request 16 bytes, the memory controller will retrieve a minimum of a 32-byte L2 sector that encompasses the area. The memory traffic, and L2 footprint should not change. Therefore, if the data is not evicted, you should still get L2 cache benefit from having retrieved the “extra” data, whether you request 4 bytes, or 16 bytes.

202476410arsmart · October 16, 2023, 4:55pm

Thank you!!! Actually, I will store all the data in shared memory for future usage! This is the first stage of pipeline, so I do not want to spend too much time (because no computation to cover it). So I think the answer is “yes”. Read 4 data will spend almost the same time to 1 data, but have 3 more data for future usage! (I guess this is what you mean)

Thank you again for your fast response!!!

Topic		Replies	Views
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1943	November 1, 2010
Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced? CUDA Programming and Performance	10	4234	October 15, 2008
Float4 must read adjacent element? Can we modify it for coalesced reading? CUDA Programming and Performance	7	1041	May 11, 2022
Coalesced Memory Read Question CUDA Programming and Performance	7	3265	February 24, 2016
global memory latency CUDA Programming and Performance	4	2174	June 22, 2008
Will compiler optimise these memory accesses CUDA Programming and Performance	3	785	July 11, 2013
Quick question about memory coalescence CUDA Programming and Performance	5	5757	May 5, 2008
global memory latency CUDA Programming and Performance	12	16299	December 13, 2007
efficient global memory access 32-, 64- or 128-bit loads ? CUDA Programming and Performance	9	4913	January 7, 2008
float4 in a register? CUDA Programming and Performance	4	2046	February 5, 2015

Uncoalesced reading using float4 can decrease wasting?

Related topics