Coalesced Memory Read Question

Hi Everybody,

I’m new in this forum so Hello to everybody:)

I have (maybe stupid) question. In my kernel I would like to work with some points data (float3 with x,y,z components)

My question is about reading data from Global memory to shared. From CUDA Best Programming Guide I know that GPU reading 128 byte word for one warp transaction. That’s mean, each warp(32 threads) can easy read any 4 byte data type in coalesced way during one cycle.
There are many of examples in the internet with this type of data and I get it. It’s easy :)

But what’s happen if I want to read in each thread float3 or float4 data type?
SM will call more 128 byte word transaction operation to satisfy whole data to be copied ?
Example:
float3 ( 3 x 4 byte = 12 byte)
32 threads x 12 byte = 384B
one transaction 128B so SM will run 3 transactions cycles in warp?

Or instead of storing data in AoS style is better to store my point data in SoA?

If my question is stupid, just sorry:)

I’m no expert but I think you’re right. If you had 32 float3’s, it would take 3 warp cycles, no matter how you arrange the data. You’ll have lower warp occupancy but you can’t really help that, now can you?

The SoA approach would still take 3 warp cycles as well. No matter how you slice it, you’re getting a full read.

You could try out the ‘trove’ header-only library which provides functionality for AoS data → GitHub - bryancatanzaro/trove: Full-speed Array of Structures access

Yes, multiple transactions will be issued.

SoA as a general recommendation is a good idea, but there shouldn’t be any problem (no difference in efficiency) with loading a float4 per thread. The GPU can read up to 16 bytes per thread in a single transaction.

float3 may be a little bit more troublesome. This has to do with alignment. Since packed float3 can’t all be aligned on a power-of-2 boundary (the GPU can load 1,2,4,8, or 16 bytes per thread) then there are various approaches to address this.

[url]Programming Guide :: CUDA Toolkit Documentation

The global reads of the float3 or int3 type comes up often, and in such cases the best approach is to either use the float4 type and ignore the .w value (or put something else there you may need) or find a way to compress the data down so it can fit into a float2 or int2 type.

I sketched out an overly tricky float3 load strategy several years ago:

The note at the bottom points out that if you can control how your float3 structs are stored then you can avoid all of this hassle and split the float3 into a simple float2 + float load (i.e. 256-byte + 128-byte transactions).

In addition to the possibilities enumerated by CudaaduC it may also be possible to process groups of 3-vectors, such that four float3 values can be temporarily re-ordered into three float4 values for storage in GPU memory, and unpacked later, after they have been loaded into the GPU. This may even come fairly naturally, for example when using quads in graphics.

If the LDG.CI (cache incoherent/texture cache) instruction is available on your hardware (Maxwell, Kepler sm_35) then you can probably just load your data in a rather naive way. The first load will populate the texture cache and then subsequent loads would pull from there at low latency / high throughput.

The CI L1 cache is really useful if you know how to leverage it. The key is to make sure you’re not overflowing it with 32 byte transactions. It only holds 768 of those (24k). So you might want to limit occupancy to avoid dropping L1 data you still intend to fetch.