Float4 must read adjacent element? Can we modify it for coalesced reading?

I am learning vectorized memory reading! Through the material below:

https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/#entry-content-comments

And I find out one truth: when we ask the system to read int2 a[2], the system will actually read in a[2] and a[3]! Will this rule hold true also for all vector type, such as int2, int3, int4, float2 float3, float4, char2, char3, char4(exist?)???

Another question is, more importantly, can we modify it to read coalescedly? Just like, when we ask the system to read int2 a[2], the system will actually read in a[2] and a[2+32]??? This will be very useful!!! Can we?

Thank you!!!

This:

#define FETCH_FLOAT2(pointer) (reinterpret_cast<float2*>(&(pointer))[0])
#define FETCH_FLOAT4(pointer) (reinterpret_cast<float4*>(&(pointer))[0])

can fail easily. GPUs require natural alignment for memory accesses. That means that an N-byte item must be accessed at an address that is am integer multiple of N bytes. Therefore float2 requires 8-byte alignment and float4 requires 16-byte alignment. Simply casting a float * with 4-byte alignment to a float2 * or float4 * can easily lead to misaligned access, unless the programmer makes sure that the required alignment is guaranteed.

The GPU hardware supports 32-bit, 64-bit, and 128-bit loads and stores. Using vector types (so up to int4, float4, double2) is an easy way to utilize these load and stores. The data accessed by each instruction is a contiguous group of 4/8/16 bytes. There is no gather/scatter functionality. Coalesced memory access in CUDA is typically achieved by mapping data to threads appropriately, notably use of the “base + thread-index” idiom of addressing global memory. Where that is not easily possible, buffering in shared memory may help.

Well, although this is not the answer to my question…But thank you!! I am also interested in it, how to use it safely?

Reviewing my post, I seem to have addressed all the questions in your initial post. Which question(s) do you consider unanswered?

1 Like

Oh, I think you mean, there is no way to access int2 a[2] and a[2+32], int2 (and other similar vectorized) can only access contiguous memory.
Thank you!!!

There is no way to access a[2] and a[2+32] in a single load. That would be a particular form of a gather operation. As I stated, each access must be to a contiguous group of bytes, and a[2] and a[2+32] are not contiguous.

Note that the fact that the hardware accesses 2n (n=2,3,4) bytes at a time and that a vector type like float3 (12 bytes) does not fit into that scheme so will probably require two accesses at the SASS (machine code) level. It may be instructive to see what is happening under the hood by examining the generated machine code with cuobjdump --dump-sass. Load instructions start with LD and store instructions with ST.

1 Like

Thank you!!!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.