About compute accuracy

half-0 · February 8, 2025, 2:31am

When using mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16, I noticed that every register’s type was reinterpreted as unsigned int. So I wonder if I can just load half type data by reinterpreting them into unsigned int.
Here is original code clip

__global__ f(half* ptr) {
  // load data from ptr to src
  __shared__ half* src[...];
  __half2 dst[4];
  for (int i = 0; i < 4; i++) {
    dst[i] = __halves2half2(src[...], src[...]);
  }
  unsigned *A = reinterpret_cast<unsigned *>(&dst);
  __asm__ __volatile__ (
    "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
    "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
    : "=r"(D[0]), "=r"(D[1])
    :
    "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),
    "r"(B[0]), "r"(B[1]),
    "r"(C[0]), "r"(C[1])
  );
}

however, to achieve better loading efficiency, I want a warp can load 128B instead of 64B when loading data from ptr (in gmem) to src. So I want to interprete both into unsigned type like following code:

__global__ f(half* ptr) {
  // load data from ptr to src
  unsigned *cp_ptr = reinterpret_cast<unsigned *>(ptr);
  __shared__ unsigned src[...];
  unsigned A[4];
  for (int i = 0; i < 4; i++) {
    A[i] = src[...];
  }
  __asm__ __volatile__ (
    "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
    "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
    : "=r"(D[0]), "=r"(D[1])
    :
    "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),
    "r"(B[0]), "r"(B[1]),
    "r"(C[0]), "r"(C[1])
  );
}

I believe there are no align problem. But the second clip result in serious calculation accuracy problem while the first clip don’t. Why? How can I fully use cacheline and keep accuracy at the same time?

half-0 · February 8, 2025, 2:44am

Also, if I want to use torch.allclose to check the answer for mma fp16-fp16, what rtol and atol config is recommanded?

half-0 · February 8, 2025, 4:54am

I also want to know, will load half type data makes loading time double in comparison with loading uint32 data? If so, will it be always a better idea to use float4 to load data ?(if don’t consider alignment)

Curefab · February 8, 2025, 4:50pm

You get a pointer to global memory and then fill A from shared memory.

The unsigned int actually is half2 data reinterpreted.

So some of your questions are about transfer speed and data handling. Others about precision.

You could do the matrix multiplication in a slow manual way on GPU or CPU with float or double and compare, if the results improve.

Also be aware that besides the input and output parameters the mma instruction also selects the data type of the accumulator. That is, why four fp16 appear in the instruction name.

If you load 16 bits per thread, then probably the L1 cache will prevent the worst. Depending on your application the execution time stays the same in the best case or doubles in the worst case.
You would try to aim that the memory loads can be served with full speed for 32 bit values (or 64 or 128) per thread from L1.

half-0 · February 9, 2025, 1:23am

Thanks! I get some new idea about efficiency. But how about accuracy? Will loading way cause accuracy decrease? Also, what absolute error margin should I choose for fp16 tensor core execution?

Curefab · February 9, 2025, 8:59am

Loading itself does not change accuracy.
There are calculation rules for error propagation and numerical computing. When calculating with floating point numbers, multiplications incur relative errors. For additions it depends, whether the numbers have a similar scale. Otherwise the smaller number loses lots of accuracy.

half-0 · February 10, 2025, 3:00am

Thanks! But I’m still confused about how the use of tensor core. In my experiment, these 2 code clip have obvious difference in result, but I think they should get the same result.
First clip

__global__ f(half* src_A, half* src_B, half* result) {
  __shared__ half sh_B[...];
  __shared__ half sh_result[...];
  __half2 cp_A[4][4]; // use to load src_A
  __half2 cp_B[8]; // use to load src_B
  half cp_result[4]; // use to store result
  unsigned *A = reinterpret_cast<unsigned *>(&cp_A);
  unsigned *B = reinterpret_cast<unsigned *>(&cp_B);
  unsigned *C = reinterpret_cast<unsigned *>(&cp_result);
  unsigned *D = reinterpret_cast<unsigned *>(&cp_result);

  int base_addr[4][2]; // these address can be calculated by threadIdx, assume they are constant
  // load A
  for (int i = 0; i < 4; i++) {
    cp_A[i][0] = __halves2half2(src_A[base_addr[i][0] * 2 + 1], src_A[base_addr[i][0] * 2  + 0]);
    cp_A[i][1] = __halves2half2(src_A[base_addr[i][1] * 2 + 1], src_A[base_addr[i][1] * 2 + 0]);
    cp_A[i][2] = __halves2half2(src_A[base_addr[i][0] * 2 + 9], src_A[base_addr[i][0] * 2 + 8]);
    cp_A[i][3] = __halves2half2(src_A[base_addr[i][1] * 2 + 9], src_A[base_addr[i][1] * 2 + 8]);
  }
  // load B
  for (int i = 0; i < 8; i++) {
    cp_B[i] = __halves2half2(sh_B[...], sh_B[...]);
  }

  // load C
  for (int i = 0; u < 4; i++) {
    cp_result[i] = sh_result[...];
  }

  // calculate
  for (int j = 0; j < 4; j++) {
    __asm__ __volatile__ (
      "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
      "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
      : "=r"(D[0]), "=r"(D[1])
      :
      "r"(A[4 * j + 0]), "r"(A[4 * j + 1]), "r"(A[4 * j + 2]), "r"(A[4 * j + 3]),
      "r"(B[2 * j + 0]), "r"(B[2 * j + 1]),
      "r"(C[0]), "r"(C[1])
     );
   }
}

second clip only change the loading type of A to load 128B per global memory access, but it result in a wrong answer

__global__ f(half* src_A, half* src_B, half* result) {
  __shared__ half sh_B[...];
  __shared__ half sh_result[...];
  unsigned * cp_A = reinterpret_cast<unsigned *>(&src_A);
  __half2 cp_B[8]; // use to load src_B
  half cp_result[4]; // use to store result
  unsigned A[16];
  unsigned *B = reinterpret_cast<unsigned *>(&cp_B);
  unsigned *C = reinterpret_cast<unsigned *>(&cp_result);
  unsigned *D = reinterpret_cast<unsigned *>(&cp_result);

  int base_addr[4][2]; // these address can be calculated by threadIdx, assume they are constant
  // load A
  for (int i = 0; i < 4; i++) {
    A[4 * i + 0] = cp_A[base_addr[i][0]];
    A[4 * i + 1] = cp_A[base_addr[i][1]];
    A[4 * i + 2] = cp_A[base_addr[i][0] + 4];
    A[4 * i + 3] = cp_A[base_addr[i][1] + 4];
  }
  // load B
  for (int i = 0; i < 8; i++) {
    cp_B[i] = __halves2half2(sh_B[...], sh_B[...]);
  }

  // load C
  for (int i = 0; u < 4; i++) {
    cp_result[i] = sh_result[...];
  }

  // calculate
  for (int j = 0; j < 4; j++) {
    __asm__ __volatile__ (
      "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
      "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
      : "=r"(D[0]), "=r"(D[1])
      :
      "r"(A[4 * j + 0]), "r"(A[4 * j + 1]), "r"(A[4 * j + 2]), "r"(A[4 * j + 3]),
      "r"(B[2 * j + 0]), "r"(B[2 * j + 1]),
      "r"(C[0]), "r"(C[1])
     );
   }
}

half-0 · February 10, 2025, 3:01am

there are 64 * 64 half data in src_A, so I don’t think there will be alignment problem. That’s what confused me most

Curefab · February 10, 2025, 6:05am

That defines 16 pointers. Leave out the *.

I cannot see 128 bit accesses with single instructions in your code. Have I missed it? Perhaps in your case the compiler creates those as optimization. You can use 128 bit types or inline PTX to explicitly load 128 bits.

half-0 · February 10, 2025, 6:12am

Sorry, the 128bit access is another version based on the second code clip. I simplified that into the second clip.

half-0 · February 10, 2025, 6:13am

My true code is unsigned A[16], I made a mistake when write it here. I’ll correct it.

half-0 · February 10, 2025, 6:14am

I just don’ t understand why when access src_A in unsigned type, the result was wrong.

Curefab · February 10, 2025, 6:29am

You use src_A in both code clips as index. Have you made sure that for the 32-bit accesses the indices have to be divided by 2 to point to the same 2 elements?

half-0 · February 10, 2025, 6:35am

actually, src_A is converted from tensor to half type pointer. the tensor shape is (27, 64, 64) and I use contiguous() to ensure they are consecutive in the memory. In the clip, I access 64 * 64 data in src_A from its beginning. So I think the indices can be divided by 2.

Curefab · February 10, 2025, 6:55am

cp_A is 1D, but later on in the loop of variant 1 you access it as 2D.

Can you run the debugger of printf to make sure you get the correct/same data in both variants?

half-0 · February 10, 2025, 6:58am

They are the same, I’m sorry that I just copy some of my original code. In first clip, I define cp_A[4][4].

half-0 · February 10, 2025, 7:01am

In my original code, my plan is to load data from src_A in unsigned type to shared memory, so that a warp can access 128B, instead of 64B for just loading them in half type. After that, each thread load data from shared memory to its register.
I wonder what is the normal way to load half precision data from global memory to shared memory. If I just load by half type, will it be 0.5x speed than load in unsigned type? Especially when load them with async?

Curefab · February 10, 2025, 7:08am

Let’s look at the first 8 half values.

Variant 1:

    cp_A[i][0] = __halves2half2(src_A[base_addr[i][0] * 2 + 1], src_A[base_addr[i][0] * 2  + 0]);
    cp_A[i][1] = __halves2half2(src_A[base_addr[i][1] * 2 + 1], src_A[base_addr[i][1] * 2 + 0]);
    cp_A[i][2] = __halves2half2(src_A[base_addr[i][0] * 2 + 9], src_A[base_addr[i][0] * 2 + 8]);
    cp_A[i][3] = __halves2half2(src_A[base_addr[i][1] * 2 + 9], src_A[base_addr[i][1] * 2 + 8]);
  }

Variant 2:

    A[4 * i + 0] = cp_A[src_A[base_addr[i][0]];
    A[4 * i + 1] = cp_A[src_A[base_addr[i][1]];
    A[4 * i + 2] = cp_A[src_A[base_addr[i][0] + 4];
    A[4 * i + 3] = cp_A[src_A[base_addr[i][1] + 4];
  }

cp_A already points to src_A. There is a double indirection. Use just one in variant 2.

half-0 · February 10, 2025, 7:11am

You are right, I will correct it.

half-0 · February 10, 2025, 9:05am

The answer is still wrong after correct this. I think there might be other problem

Topic		Replies	Views
load int8 shared memory data into fp16 wmma::fragment CUDA Programming and Performance	0	530	August 7, 2019
Load data for tensor core CUDA Programming and Performance	23	258	February 5, 2025
Wrong answer with mma.sync.aligned.m8n8k4 CUDA Programming and Performance cuda , kernel	8	1516	April 17, 2023
error when trying to use half (fp16) CUDA Programming and Performance	16	20476	October 13, 2015
Poor half performance CUDA Programming and Performance	13	2586	June 19, 2025
Coalesced Memory access related doubt CUDA Programming and Performance	13	2230	December 9, 2010
Padding of mma operation CUDA Programming and Performance	20	332	December 19, 2024
shared memory load affects precision of floating point computation CUDA Programming and Performance	6	1768	January 28, 2010
Test of new 16 bit float half type in CUDA 7.5 CUDA Programming and Performance	12	5466	June 7, 2016
Warp Misaligned Address when padding shared memory loaded with float4 CUDA Programming and Performance	7	158	July 24, 2025

About compute accuracy

Related topics