I am learning vectorized memory reading! Through the material below:
And I find out one truth: when we ask the system to read int2 a, the system will actually read in a and a! Will this rule hold true also for all vector type, such as int2, int3, int4, float2 float3, float4, char2, char3, char4(exist?)???
Another question is, more importantly, can we modify it to read coalescedly? Just like, when we ask the system to read int2 a, the system will actually read in a and a[2+32]??? This will be very useful!!! Can we?
can fail easily. GPUs require natural alignment for memory accesses. That means that an N-byte item must be accessed at an address that is am integer multiple of N bytes. Therefore float2 requires 8-byte alignment and float4 requires 16-byte alignment. Simply casting a float * with 4-byte alignment to a float2 * or float4 * can easily lead to misaligned access, unless the programmer makes sure that the required alignment is guaranteed.
The GPU hardware supports 32-bit, 64-bit, and 128-bit loads and stores. Using vector types (so up to int4, float4, double2) is an easy way to utilize these load and stores. The data accessed by each instruction is a contiguous group of 4/8/16 bytes. There is no gather/scatter functionality. Coalesced memory access in CUDA is typically achieved by mapping data to threads appropriately, notably use of the “base + thread-index” idiom of addressing global memory. Where that is not easily possible, buffering in shared memory may help.
There is no way to access a and a[2+32] in a single load. That would be a particular form of a gather operation. As I stated, each access must be to a contiguous group of bytes, and a and a[2+32] are not contiguous.
Note that the fact that the hardware accesses 2n (n=2,3,4) bytes at a time and that a vector type like float3 (12 bytes) does not fit into that scheme so will probably require two accesses at the SASS (machine code) level. It may be instructive to see what is happening under the hood by examining the generated machine code with cuobjdump --dump-sass. Load instructions start with LD and store instructions with ST.