Vectorize Read From Device Memory to Registers

Hi,
I am trying to do this part of the memory operation faster. Basically, it is trying to read a block of a matrix into registers. The matrix size is 128x128 (N=128) and 128 threads are working to load one row of the matrix to itself register.

         const int tx  = threadIdx.x;
         const int gtx = blockIdx.x * blockDim.x + tx;
 
         double *A = dA + step + step * lda;
         double rA[N];

         #pragma unroll
          for(int i = 0; i < N; i++)
                 rA[i]=A[gtx + i * lda];
  1. Is it now being read from memory sequentially?

  2. How can this operation be vectorized? Is it worthwhile?

If you do this operation across threads in a warp or threadblock, then each thread will end up with a “column” of the A matrix, deposited in the thread-local rA variable. The data is not read from memory “sequentially” but in blocks that are coalesced.

That doesn’t necessarily imply that all of rA will manifest in registers.

And regarding vectorization, I don’t understand the question. Regarding the load from A it is already efficient.

Thanks for your answer. About vectorization I mean something like this: https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

Reading larger memory in each call e.g. 2xdouble in each call and reducing the for loop iteration to half.

Or the following code from this article : https://developer.nvidia.com/blog/cooperative-groups/

__device__ int thread_sum(int *input, int n) 
{
    int sum = 0;

    for(int i = blockIdx.x * blockDim.x + threadIdx.x;
        i < n / 4; 
        i += blockDim.x * gridDim.x)
    {
        int4 in = ((int4*)input)[i];
        sum += in.x + in.y + in.z + in.w;
    }
    return sum;
}

why not try it and see?

I expect the benefit to be small, if the load is being done at least warp-wide and the base type is int.

Note that doing a vector load would change what data actually ends up in each threads rA variable. So the two methods are not interchangeable, unless you do data swizzling also.

Actually I have no idea how to do it for double data type.

vector load in CUDA tops out at 16 bytes per load per thread. Since a double quantity is already 8 bytes, there is no vector-loadable double4 type. You would vector load a double2 type, and get two quantities per thread, per load.

And as I said already, the two methods are not interchangeable without additional consideration.

1 Like

What is the additional consideration? Do you mean it will change the representation of the floating point?

Suppose the storage of A in global memory looks like this:

0.1 0.2 0.3 0.4 
1.1 1.2 1.3 1.4
2.1 2.2 2.3 2.4
3.1 3.2 3.3 3.4

In your original formulation, the thread 0 rA variable will contain (after the loop processing is complete)

0.1 1.1 2.1 3.1

i.e. it will contain column 0 of the A matrix

If we did a refactoring to do vector load of double2 it might look approximately like this:

     double2 *rA2 = reinterpret_cast<double2 *>(&(rA[0]));
     double2 *A2 = reinterpret_cast<double2 *>(&(A[0]));
     int N2 = N/2;
     #pragma unroll
      for(int i = 0; i < N2; i++)
             rA2[i]=A2[gtx + i * lda];

If you went that way, then rA for thread 0 would have:

0.1 0.2 1.1 1.2

i.e. it would not match the previous case. This is because a vector load, by its nature, loads adjacent quantities, because that is how a vector type is stored in memory. It cannot duplicate a columnwise load.

You could fix this either by reorganizing the data originally stored in A or else moving data around between threads after the vector load loop is complete.

This is a lot of work for something that I suspect will lead to very little benefit. Therefore I don’t intend to flesh it out any further.

(the actual values in rA might not be 0.1 0.2 1.1 1.2, but the first two values would be 0.1 0.2 and that is enough to demonstrate the discrepancy. Determining exactly what would be in each thread would depend on a more complete case than what you or I have shown, and as I’ve already indicated, I don’t intend to expend the effort to clarify it any further.)

There are situations where data needs to be loaded row-wise and then converted to column wise storage across threads. in that case you can find various forum questions that discuss how to do this (transposing data, with some granularity). Here is one. And the trove library also has this topic in view. Again, I would not recommend taking on this additional complexity for anything I have seen so far in this thread.

You can correct the double2 vector load with shuffle instructions.

bool odd = threadIdx.x & 1;
int idx = threadIdx.x / 2 + odd ? blockDim.x / 2 : 0;
double2 read = array[idx];
double other = __shfl_xor_sync(0xFFFFFFFF, odd ? read.x : read.y, 1);
double res1 = odd ? other : read.x;
double res2 = odd ? read.y : other;

The even threads read the first n (blockDim.x) double values from the array, the odd threads read the second n double values. The three ternary operators decide which data to exchange and what is the first and what is the second data for that actual thread position.

There are simpler ways to do it with more shuffle instructions, but shuffle instructions occupy the shared memory and are possibly more expensive than simple select (SEL) instructions (the ternary operator), which can run locally in the SM partition in the integer unit.

(This shown code was one way of fleshing out, what Robert described as ‘moving data around between threads after the vector load’.)

A simpler alternative could be that each thread processes two consecutive array elements.

For 16-bit data there is also a ldmatrix function since Turing (compute capability 7.5), which when ‘misused’ (PTX ISA :: CUDA Toolkit Documentation) provides up to 16-bytes vector load and distribution to the threads of a warp. One has to use the transposed variant (ldmatrix.sync.aligned.x1.trans.m8n8.shared.b16 to ldmatrix.sync.aligned.x4.trans.m8n8.shared.b16) for it, otherwise it is just like a normal vector load.