Float4 must read adjacent element? Can we modify it for coalesced reading?

202476410arsmart · April 27, 2022, 4:11am

I am learning vectorized memory reading! Through the material below:

Liu-xiandong/How_to_optimize_in_GPU/blob/master/elementwise/elementwise_add.cu

#include <bits/stdc++.h>
#include <cuda.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <time.h>
#include <sys/time.h>

#define THREAD_PER_BLOCK 256

// transfer vector
#define FETCH_FLOAT2(pointer) (reinterpret_cast<float2*>(&(pointer))[0])
#define FETCH_FLOAT4(pointer) (reinterpret_cast<float4*>(&(pointer))[0])

__global__ void add(float* a, float* b, float* c)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    c[idx] = a[idx] + b[idx];
}
__global__ void vec2_add(float* a, float* b, float* c)
{

This file has been truncated. show original

https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/#entry-content-comments

And I find out one truth: when we ask the system to read int2 a[2], the system will actually read in a[2] and a[3]! Will this rule hold true also for all vector type, such as int2, int3, int4, float2 float3, float4, char2, char3, char4(exist?)???

Another question is, more importantly, can we modify it to read coalescedly? Just like, when we ask the system to read int2 a[2], the system will actually read in a[2] and a[2+32]??? This will be very useful!!! Can we?

Thank you!!!

njuffa · April 27, 2022, 7:10am

This:

#define FETCH_FLOAT2(pointer) (reinterpret_cast<float2*>(&(pointer))[0])
#define FETCH_FLOAT4(pointer) (reinterpret_cast<float4*>(&(pointer))[0])

can fail easily. GPUs require natural alignment for memory accesses. That means that an N-byte item must be accessed at an address that is am integer multiple of N bytes. Therefore float2 requires 8-byte alignment and float4 requires 16-byte alignment. Simply casting a float * with 4-byte alignment to a float2 * or float4 * can easily lead to misaligned access, unless the programmer makes sure that the required alignment is guaranteed.

The GPU hardware supports 32-bit, 64-bit, and 128-bit loads and stores. Using vector types (so up to int4, float4, double2) is an easy way to utilize these load and stores. The data accessed by each instruction is a contiguous group of 4/8/16 bytes. There is no gather/scatter functionality. Coalesced memory access in CUDA is typically achieved by mapping data to threads appropriately, notably use of the “base + thread-index” idiom of addressing global memory. Where that is not easily possible, buffering in shared memory may help.

202476410arsmart · April 27, 2022, 7:38am

Well, although this is not the answer to my question…But thank you!! I am also interested in it, how to use it safely?

njuffa · April 27, 2022, 7:42am

Reviewing my post, I seem to have addressed all the questions in your initial post. Which question(s) do you consider unanswered?

202476410arsmart · April 27, 2022, 7:45am

Oh, I think you mean, there is no way to access int2 a[2] and a[2+32], int2 (and other similar vectorized) can only access contiguous memory.
Thank you!!!

njuffa · April 27, 2022, 7:51am

There is no way to access a[2] and a[2+32] in a single load. That would be a particular form of a gather operation. As I stated, each access must be to a contiguous group of bytes, and a[2] and a[2+32] are not contiguous.

Note that the fact that the hardware accesses 2ⁿ (n=2,3,4) bytes at a time and that a vector type like float3 (12 bytes) does not fit into that scheme so will probably require two accesses at the SASS (machine code) level. It may be instructive to see what is happening under the hood by examining the generated machine code with cuobjdump --dump-sass. Load instructions start with LD and store instructions with ST.

202476410arsmart · April 27, 2022, 7:54am

Thank you!!!

system · May 11, 2022, 7:55am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.