Coalesced Memory Read Question

archiseb · February 23, 2016, 9:04pm

Hi Everybody,

I’m new in this forum so Hello to everybody:)

I have (maybe stupid) question. In my kernel I would like to work with some points data (float3 with x,y,z components)

My question is about reading data from Global memory to shared. From CUDA Best Programming Guide I know that GPU reading 128 byte word for one warp transaction. That’s mean, each warp(32 threads) can easy read any 4 byte data type in coalesced way during one cycle.
There are many of examples in the internet with this type of data and I get it. It’s easy :)

But what’s happen if I want to read in each thread float3 or float4 data type?
SM will call more 128 byte word transaction operation to satisfy whole data to be copied ?
Example:
float3 ( 3 x 4 byte = 12 byte)
32 threads x 12 byte = 384B
one transaction 128B so SM will run 3 transactions cycles in warp?

Or instead of storing data in AoS style is better to store my point data in SoA?

If my question is stupid, just sorry:)

MutantJohn · February 24, 2016, 5:46am

I’m no expert but I think you’re right. If you had 32 float3’s, it would take 3 warp cycles, no matter how you arrange the data. You’ll have lower warp occupancy but you can’t really help that, now can you?

The SoA approach would still take 3 warp cycles as well. No matter how you slice it, you’re getting a full read.

HannesF99 · February 24, 2016, 8:26am

You could try out the ‘trove’ header-only library which provides functionality for AoS data → GitHub - bryancatanzaro/trove: Full-speed Array of Structures access

Robert_Crovella · February 24, 2016, 7:48pm

Yes, multiple transactions will be issued.

SoA as a general recommendation is a good idea, but there shouldn’t be any problem (no difference in efficiency) with loading a float4 per thread. The GPU can read up to 16 bytes per thread in a single transaction.

float3 may be a little bit more troublesome. This has to do with alignment. Since packed float3 can’t all be aligned on a power-of-2 boundary (the GPU can load 1,2,4,8, or 16 bytes per thread) then there are various approaches to address this.

[url]Programming Guide :: CUDA Toolkit Documentation

CudaaduC · February 24, 2016, 9:01pm

The global reads of the float3 or int3 type comes up often, and in such cases the best approach is to either use the float4 type and ignore the .w value (or put something else there you may need) or find a way to compress the data down so it can fit into a float2 or int2 type.

allanmac · February 24, 2016, 9:51pm

I sketched out an overly tricky float3 load strategy several years ago:

The note at the bottom points out that if you can control how your float3 structs are stored then you can avoid all of this hassle and split the float3 into a simple float2 + float load (i.e. 256-byte + 128-byte transactions).

njuffa · February 24, 2016, 9:54pm

In addition to the possibilities enumerated by CudaaduC it may also be possible to process groups of 3-vectors, such that four float3 values can be temporarily re-ordered into three float4 values for storage in GPU memory, and unpacked later, after they have been loaded into the GPU. This may even come fairly naturally, for example when using quads in graphics.

scottgray · February 24, 2016, 10:52pm

If the LDG.CI (cache incoherent/texture cache) instruction is available on your hardware (Maxwell, Kepler sm_35) then you can probably just load your data in a rather naive way. The first load will populate the texture cache and then subsequent loads would pull from there at low latency / high throughput.

The CI L1 cache is really useful if you know how to leverage it. The key is to make sure you’re not overflowing it with 32 byte transactions. It only holds 768 of those (24k). So you might want to limit occupancy to avoid dropping L1 data you still intend to fetch.

Topic		Replies	Views
Beginner's question CUDA Programming and Performance	2	477	July 3, 2019
performance for global and shared memory CUDA Programming and Performance	2	6235	January 15, 2008
global memory latency CUDA Programming and Performance	12	16186	December 13, 2007
efficient global memory access 32-, 64- or 128-bit loads ? CUDA Programming and Performance	9	4733	January 7, 2008
Memory access coalescing Vs. the compiler CUDA Programming and Performance	2	2438	July 23, 2007
Question about coalesced memory access CUDA Programming and Performance	10	2763	September 24, 2009
Memory Coalescing CUDA Programming and Performance	5	9278	October 15, 2011
Reading from global memory to registers in a fast way CUDA Programming and Performance	10	2100	November 15, 2021
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5498	March 6, 2007
Coalesced structures possible? CUDA Programming and Performance	3	2111	February 1, 2016

Coalesced Memory Read Question

Related topics