About compute accuracy

When using mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16, I noticed that every register’s type was reinterpreted as unsigned int. So I wonder if I can just load half type data by reinterpreting them into unsigned int.
Here is original code clip

__global__ f(half* ptr) {
  // load data from ptr to src
  __shared__ half* src[...];
  __half2 dst[4];
  for (int i = 0; i < 4; i++) {
    dst[i] = __halves2half2(src[...], src[...]);
  }
  unsigned *A = reinterpret_cast<unsigned *>(&dst);
  __asm__ __volatile__ (
    "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
    "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
    : "=r"(D[0]), "=r"(D[1])
    :
    "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),
    "r"(B[0]), "r"(B[1]),
    "r"(C[0]), "r"(C[1])
  );
}

however, to achieve better loading efficiency, I want a warp can load 128B instead of 64B when loading data from ptr (in gmem) to src. So I want to interprete both into unsigned type like following code:

__global__ f(half* ptr) {
  // load data from ptr to src
  unsigned *cp_ptr = reinterpret_cast<unsigned *>(ptr);
  __shared__ unsigned src[...];
  unsigned A[4];
  for (int i = 0; i < 4; i++) {
    A[i] = src[...];
  }
  __asm__ __volatile__ (
    "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
    "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
    : "=r"(D[0]), "=r"(D[1])
    :
    "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),
    "r"(B[0]), "r"(B[1]),
    "r"(C[0]), "r"(C[1])
  );
}

I believe there are no align problem. But the second clip result in serious calculation accuracy problem while the first clip don’t. Why? How can I fully use cacheline and keep accuracy at the same time?

Also, if I want to use torch.allclose to check the answer for mma fp16-fp16, what rtol and atol config is recommanded?

I also want to know, will load half type data makes loading time double in comparison with loading uint32 data? If so, will it be always a better idea to use float4 to load data ?(if don’t consider alignment)

You get a pointer to global memory and then fill A from shared memory.

The unsigned int actually is half2 data reinterpreted.

So some of your questions are about transfer speed and data handling. Others about precision.

You could do the matrix multiplication in a slow manual way on GPU or CPU with float or double and compare, if the results improve.

Also be aware that besides the input and output parameters the mma instruction also selects the data type of the accumulator. That is, why four fp16 appear in the instruction name.

If you load 16 bits per thread, then probably the L1 cache will prevent the worst. Depending on your application the execution time stays the same in the best case or doubles in the worst case.
You would try to aim that the memory loads can be served with full speed for 32 bit values (or 64 or 128) per thread from L1.

Thanks! I get some new idea about efficiency. But how about accuracy? Will loading way cause accuracy decrease? Also, what absolute error margin should I choose for fp16 tensor core execution?

Loading itself does not change accuracy.
There are calculation rules for error propagation and numerical computing. When calculating with floating point numbers, multiplications incur relative errors. For additions it depends, whether the numbers have a similar scale. Otherwise the smaller number loses lots of accuracy.

Thanks! But I’m still confused about how the use of tensor core. In my experiment, these 2 code clip have obvious difference in result, but I think they should get the same result.
First clip

__global__ f(half* src_A, half* src_B, half* result) {
  __shared__ half sh_B[...];
  __shared__ half sh_result[...];
  __half2 cp_A[4][4]; // use to load src_A
  __half2 cp_B[8]; // use to load src_B
  half cp_result[4]; // use to store result
  unsigned *A = reinterpret_cast<unsigned *>(&cp_A);
  unsigned *B = reinterpret_cast<unsigned *>(&cp_B);
  unsigned *C = reinterpret_cast<unsigned *>(&cp_result);
  unsigned *D = reinterpret_cast<unsigned *>(&cp_result);

  int base_addr[4][2]; // these address can be calculated by threadIdx, assume they are constant
  // load A
  for (int i = 0; i < 4; i++) {
    cp_A[i][0] = __halves2half2(src_A[base_addr[i][0] * 2 + 1], src_A[base_addr[i][0] * 2  + 0]);
    cp_A[i][1] = __halves2half2(src_A[base_addr[i][1] * 2 + 1], src_A[base_addr[i][1] * 2 + 0]);
    cp_A[i][2] = __halves2half2(src_A[base_addr[i][0] * 2 + 9], src_A[base_addr[i][0] * 2 + 8]);
    cp_A[i][3] = __halves2half2(src_A[base_addr[i][1] * 2 + 9], src_A[base_addr[i][1] * 2 + 8]);
  }
  // load B
  for (int i = 0; i < 8; i++) {
    cp_B[i] = __halves2half2(sh_B[...], sh_B[...]);
  }

  // load C
  for (int i = 0; u < 4; i++) {
    cp_result[i] = sh_result[...];
  }

  // calculate
  for (int j = 0; j < 4; j++) {
    __asm__ __volatile__ (
      "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
      "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
      : "=r"(D[0]), "=r"(D[1])
      :
      "r"(A[4 * j + 0]), "r"(A[4 * j + 1]), "r"(A[4 * j + 2]), "r"(A[4 * j + 3]),
      "r"(B[2 * j + 0]), "r"(B[2 * j + 1]),
      "r"(C[0]), "r"(C[1])
     );
   }
}

second clip only change the loading type of A to load 128B per global memory access, but it result in a wrong answer

__global__ f(half* src_A, half* src_B, half* result) {
  __shared__ half sh_B[...];
  __shared__ half sh_result[...];
  unsigned * cp_A = reinterpret_cast<unsigned *>(&src_A);
  __half2 cp_B[8]; // use to load src_B
  half cp_result[4]; // use to store result
  unsigned A[16];
  unsigned *B = reinterpret_cast<unsigned *>(&cp_B);
  unsigned *C = reinterpret_cast<unsigned *>(&cp_result);
  unsigned *D = reinterpret_cast<unsigned *>(&cp_result);

  int base_addr[4][2]; // these address can be calculated by threadIdx, assume they are constant
  // load A
  for (int i = 0; i < 4; i++) {
    A[4 * i + 0] = cp_A[base_addr[i][0]];
    A[4 * i + 1] = cp_A[base_addr[i][1]];
    A[4 * i + 2] = cp_A[base_addr[i][0] + 4];
    A[4 * i + 3] = cp_A[base_addr[i][1] + 4];
  }
  // load B
  for (int i = 0; i < 8; i++) {
    cp_B[i] = __halves2half2(sh_B[...], sh_B[...]);
  }

  // load C
  for (int i = 0; u < 4; i++) {
    cp_result[i] = sh_result[...];
  }

  // calculate
  for (int j = 0; j < 4; j++) {
    __asm__ __volatile__ (
      "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
      "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
      : "=r"(D[0]), "=r"(D[1])
      :
      "r"(A[4 * j + 0]), "r"(A[4 * j + 1]), "r"(A[4 * j + 2]), "r"(A[4 * j + 3]),
      "r"(B[2 * j + 0]), "r"(B[2 * j + 1]),
      "r"(C[0]), "r"(C[1])
     );
   }
}

there are 64 * 64 half data in src_A, so I don’t think there will be alignment problem. That’s what confused me most

That defines 16 pointers. Leave out the *.

I cannot see 128 bit accesses with single instructions in your code. Have I missed it? Perhaps in your case the compiler creates those as optimization. You can use 128 bit types or inline PTX to explicitly load 128 bits.

Sorry, the 128bit access is another version based on the second code clip. I simplified that into the second clip.

My true code is unsigned A[16], I made a mistake when write it here. I’ll correct it.

I just don’ t understand why when access src_A in unsigned type, the result was wrong.

You use src_A in both code clips as index. Have you made sure that for the 32-bit accesses the indices have to be divided by 2 to point to the same 2 elements?

actually, src_A is converted from tensor to half type pointer. the tensor shape is (27, 64, 64) and I use contiguous() to ensure they are consecutive in the memory. In the clip, I access 64 * 64 data in src_A from its beginning. So I think the indices can be divided by 2.

cp_A is 1D, but later on in the loop of variant 1 you access it as 2D.

Can you run the debugger of printf to make sure you get the correct/same data in both variants?

They are the same, I’m sorry that I just copy some of my original code. In first clip, I define cp_A[4][4].

In my original code, my plan is to load data from src_A in unsigned type to shared memory, so that a warp can access 128B, instead of 64B for just loading them in half type. After that, each thread load data from shared memory to its register.
I wonder what is the normal way to load half precision data from global memory to shared memory. If I just load by half type, will it be 0.5x speed than load in unsigned type? Especially when load them with async?

Let’s look at the first 8 half values.

Variant 1:

    cp_A[i][0] = __halves2half2(src_A[base_addr[i][0] * 2 + 1], src_A[base_addr[i][0] * 2  + 0]);
    cp_A[i][1] = __halves2half2(src_A[base_addr[i][1] * 2 + 1], src_A[base_addr[i][1] * 2 + 0]);
    cp_A[i][2] = __halves2half2(src_A[base_addr[i][0] * 2 + 9], src_A[base_addr[i][0] * 2 + 8]);
    cp_A[i][3] = __halves2half2(src_A[base_addr[i][1] * 2 + 9], src_A[base_addr[i][1] * 2 + 8]);
  }

Variant 2:

    A[4 * i + 0] = cp_A[src_A[base_addr[i][0]];
    A[4 * i + 1] = cp_A[src_A[base_addr[i][1]];
    A[4 * i + 2] = cp_A[src_A[base_addr[i][0] + 4];
    A[4 * i + 3] = cp_A[src_A[base_addr[i][1] + 4];
  }

cp_A already points to src_A. There is a double indirection. Use just one in variant 2.

You are right, I will correct it.

The answer is still wrong after correct this. I think there might be other problem