Error or incomprehension, MMa ptx mixed precision Bfloat16 rtx3080

Hello this is the first time I ask for help on a forum, I hope I will be able to be clear and give you all the necessary information, thank you for your indulgence if you need more information to help me I would provide them quickly of course

I am trying to use the Mma instruction, I was using an rtx3080, I copied the cultass implementation mma sm80 for my use.
Here is my problem I do not really know what Mma instruction is doing I thought this one would do a simple matrix multiplication:

resulta = [ [(a[0] * b[0] + a[0] * b[1]) + d[0] ],
[ (a[1] * b[0] + a[1] * b[1]) + d[1] ],
[ (a[2] * b[0] + a[2] * b[1]) + d[2] ],
[ (a[3] * b[0] + a[3] * b[1]) + d[3] ]]

if i have an error in my implementation surely in the transformation of the float32 into bfloat16 or in the combination of two bfloat16 into float32.
Or otherwise the conversion operations are correct and if someone can explain to me the operations of the operation Mma thank you very much!

Code_Nvidia|580x499

here are the results of my tests

Resulta 1:
a = {1, 1, 1, 1}
b = {1, 1}
c = {0, 0, 0, 0}

D = {8, 8, 8, 8}

Resulta 2:
a = {1, 1, 1, 1}
b = {1, 1}
c = {1, 1, 1, 1}

D = {9, 9, 9, 9}

Resulta 3:
a = {2, 2, 2, 2}
b = {1, 1}
c = {0, 0, 0, 0}

D = {16, 16, 16, 16}

Resulta 4:
a = {2, 2, 2, 2}
b = {1, 1}
c = {0, 0, 0, 0}

D = {16, 16, 16, 16}

Resulta 5:
a = {2, 1, 1, 1}
b = {1, 1}
c = {0, 0, 0, 0}

D = {12, 12, 8, 8}

Resulta 6:
a = {1, 1, 1, 2}
b = {1, 1}
c = {0, 0, 0, 0}

D = {8, 8, 12, 12}

Resulta 6:
a = {1, 1, 1, 2}
b = {1, 1}
c = {0, 0, 0, 0}

D = {8, 8, 12, 12}

Resulta 7:
a = {0.2, 0.2, 0.2, 0.2}
b = {0.2, 0.2}
c = {0, 0, 0, 0}

D = {0.320625, 0.320625, 0.320625, 0.320625}

Resulta 8:
a = {0.2, 0.4, 0.2, 0.2}
b = {0.2, 0.2}
c = {0, 0, 0, 0}

D = {0.480938, 0.480938, 0.320625, 0.320625}

Resulta 8:
a = {0.2, 0.4, 0.2, 0.2}
b = {0.2, 0.2}
c = {2, 4, 8, 1.5}

D = {2.48094, 4.48094, 8.32063, 1.82063}

[The following assumes that the above post was written by a human and not by an AI-bot]

I realize it may be due to auto-completion software of some kind, but your post contains a word wholly unsuitable for this forum. Please fix:

I assume you meant to write “packing” or “conversion” or some such.

Also, please don’t post images of text. Instead, post the text itself. Usually issues are best discussed on the basis of a minimal self-contained code that reproduces the issue and that others can compile and run.

code

#include <cuda.h> 
#include "mma.h"
#include <iostream>
#include <stdio.h>
using namespace std;

__global__ void mma_fp16_acc_fp32(void *out) {
    float c[4] = {0., 0., 0., 0.};
    float d[4] = {0., 0., 0., 0.};
    float a32[4] = {1., 1., 1., 1.};
    float b32[2] = {1., 1.};
    uint16_t a[4];
    uint16_t b[2];
    // convertion float32 -> bfloat16
    asm("cvt.rn.bf16.f32 %0, %1;\n" : "=h"(a[0]) : "f"(a32[0]));
    asm("cvt.rn.bf16.f32 %0, %1;\n" : "=h"(a[1]) : "f"(a32[1]));
    asm("cvt.rn.bf16.f32 %0, %1;\n" : "=h"(a[2]) : "f"(a32[2]));
    asm("cvt.rn.bf16.f32 %0, %1;\n" : "=h"(a[3]) : "f"(a32[3]));
    asm("cvt.rn.bf16.f32 %0, %1;\n" : "=h"(b[0]) : "f"(b32[0]));
    asm("cvt.rn.bf16.f32 %0, %1;\n" : "=h"(b[1]) : "f"(b32[1]));
    // Cast 2*bfloat16 -> uint32_t
    uint32_t const *A = reinterpret_cast<uint32_t const *>(&a);
    uint32_t const *B = reinterpret_cast<uint32_t const *>(&b);
    float const *C = reinterpret_cast<float const *>(&c);
    float *D = reinterpret_cast<float *>(&d);
    asm(
      "mma.sync.aligned.m16n8k8.row.col.f32.bf16.bf16.f32 "
      "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
      : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
      : 
        "r"(A[0]), "r"(A[1]), 
        "r"(B[0]), 
        "f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
    );
    memcpy(out, D, 128);
}

int main() {
    float* h_C = (float*)malloc(128);
    float* d_C;
    cudaMalloc(&d_C, 128);
    mma_fp16_acc_fp32<<<1, 32>>>(d_C);
    cudaDeviceSynchronize();
    cudaMemcpy(h_C, d_C, 128, cudaMemcpyDeviceToHost);
    cout << h_C[0];
    cout << h_C[1];
    cout << h_C[2];
    cout << h_C[3];
}

i hope the copier paste some code allows you to the compiler i dont know how to make it more share it more cleanly, thank

I am relaunching the subject to know if someone masters the programming of tensor cores, thank you

That doesn’t look like any matrix multiply I am familiar with. I don’t know how you multiply a 4x1 vector by a 2x1 vector to get a 4x1 result. To be very specific, a matrix multiply should produce a result matrix, where each element is a vector dot product of row or column vectors extracted from the A,B input matrices. Given that definition, under what circumstances would you expect an element of A to appear twice in an individual vector dot-product?

[ [(a[0] * b[0] + a[0] * b[1]) + d[0] ],
    ^^^^          ^^^^

Moving on to your code, the mma op you have chosen has m,n,k values of 16,8,8. Since your first example:

Is loading the same values for each thread, the above a,b values are populating the entire A,B matrices with 1. As a result of this, and due to the fact that your k value is 8, its expected that you will receive an output matrix consisting entirely of 8:

Using a similar logic, at least up through your published result 7, your results look sensible to me, for the input data you are providing. (Based on this, I would also surmise that your bfloat conversions are working correctly.)

1 Like

thank you very much for your answer, my matrix multiplication example is wrong I’m sorry.
i am redoing my code and noticed that one of the problems if i am not wrong, is that I do not load the matrices in the registers

but I have another misunderstanding, this is the size of the matrixes, reading your message i believe i am not providing the correct size

I thought for the instruction m16n8k8, the vector size was:
A[0] = 1 * bf162
A[1] = 1 * bf16
2
B[0] = 1 * bf16*2

D[0] = 1 * f32, so D[1], D[2], D[3], C[0], C[1], C[2], C[3]

if you can tell me the size of these vectors it will help me enormously

Thanks again for your help Robert_Crovella

referring to the docs for the op you have chosen:

Per thread, for 16 bit A/B elements:
For A, you are expected to provide four 16-bit float elements, packed into two 32-bit registers.
For B, you are expected to provide two 16-bit float elements, packed into one 32-bit register.
For C/D, you are expected to provide four elements.

1 Like

thank you very much for your help Robert_Crovella,
if i understand correctly to implement Fully Connected Layer, mma.m16n8k8 do not adapt

sorry to ask you one more question Robert_Crovella, but I don’t understand how to implement a fully connected layer, if the MMA intruction in .bf16, are .m16n8k8, .m16n8k16 and these instructions multiply the same number several times if I understand correctly, while for a fully connected layer, I will need I think having as result

[2, 2, 2, 2] For
A[0] = [1, 1]
A[1] = [1, 1]
B[0] = [1, 1]
C = [0, 0, 0, 0]

thank

By my way of thinking a forward-pass fully connected layer update involves a matrix-vector multiplication, not a matrix-matrix multiplication. (You can convert that matrix-vector multiply to a matrix-matrix multiply using techniques such as batching, or switching to a convolutional layer, etc.)

So, no, I don’t know how to map a single FC layer update to a matrix-matrix multiply unless one of the two matrices is a vector. You’d need a tensor op like m8n1k8 or something like that, and AFAICT no such op exists.

matrix-vector multiply doesn’t have the same arithmetic intensity as matrix-matrix multiply, generally speaking. The tensorcore unit is designed to accelerated matrix-matrix multiply, primarily, not matrix-vector.

You’re welcome to do as you wish, of course, but rather than trying to program these things at the lowest level possible (PTX! yikes) there are libraries like CUBLAS and cuDNN which have many interesting functions and capabilities.

1 Like

Thanks a lot for your help, Robert_Crovella
there is something i dont understand, Mma is doing matrix-matrix multiplication, but taking three 1 * 2 vectors and resulting in a 1 * 4 vector.

So if this is the case, be the vectors 1 * 2 and transform into a matrix compose by several times the same vectors and the output is reduced to a vector? which must not be the case because otherwise we cannot choose the values of our matrices to enter

or else they can pass a matrix to their instruction or they can choose each value and receive a material product in the form of a matrix

Yes, I know it’s framework but I’m looking to design a framework for use cases where cultass is not functional

Do you mean CUTLASS?

I believe cutlass it is the base of cublass for matrixial multiplications and uses the gemms module, cutlass is the basis of tensorflow vs pytorch, although cultass and manific but I think that in specific use gemms and extremely sub optimal, and does not allow to use the power and modularity of rtx and cuda so I have to master the tensor cores.

in fact what can make gemms under efficient is that it uses the gpu as a tpu in most cases it is optimum but in some the modularities offered jointly by the tensors core and the core cuda can be much more efficient than tpu

No, it isn’t. For the op you have chosen (M=16, N=8, K=8) It is taking a 16x8 matrix, an 8x8 matrix, and producing a 16x8 matrix result, considered warp wide. It really doesn’t make sense to try and think about this as the per-thread fragments, because there is no per-thread operation. The actual operation is warp wide. Please re-read the documentation I linked. It points out for the op you have chosen, there is for A a 16x8 matrix and for B an 8x8 matrix. Looking at the per-thread fragments, it makes sense that each thread contributes four elements of A, because when considered warp-wide, that is 4x32 elements, which is equal to 16x8. You can observe similar arithmetic for B,C,D.

The operation is a warp-wide matrix multiply, consisting of a 16x8 A matrix, an 8x8 B matrix, a 16x8 C and a 16x8 D matrix. There is no point that I can think of in trying to look at it as a per-thread op using the fragments. The result fragment D will have results considered warp-wide, and is dependent on fragments provided by other threads. That is why your result fragment D in the first example contains 8’s.

1 Like

thank you so much Robert_Crovella
if I feed the registers with vectors it is because I cannot put them matrices in the registers

hi Robert_Crovella

I try to give matrices of size 16 * 8 or 8 * 16 to the Mma instruction and I always get this error message while I feel like following the documentation I don’t understand my error

error:
ptxas /tmp/tmpxft_00001dec_00000000-6_array.ptx, line 36; error : Unexpected instruction types specified for ‘mma’
ptxas /tmp/tmpxft_00001dec_00000000-6_array.ptx, line 36; error : Arguments mismatch for instruction ‘mma’
ptxas /tmp/tmpxft_00001dec_00000000-6_array.ptx, line 36; error : Arguments mismatch for instruction ‘mma’
ptxas /tmp/tmpxft_00001dec_00000000-6_array.ptx, line 36; error : Unexpected instruction types specified for ‘mma’
ptxas fatal : Ptx assembly aborted due to errors

code :

#include <cuda.h>
#include “mma.h”
#include
#include <stdio.h>
using namespace std;

global void mma_fp16_acc_fp32(void *out) {
float C[16][8] = {0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0};
float a = 0;
float a2 = 0;
float b = 0;
float b2 = 0;
uint32_t A[16][8];
uint32_t B[8][8];
// A
// 1
for (int i = 0; i < 16; i++) {
asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][0]) : “f”(a), “f”(a2));
asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][1]) : “f”(a), “f”(a2));
asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][2]) : “f”(a), “f”(a2));
asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][3]) : “f”(a), “f”(a2));
asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][4]) : “f”(a), “f”(a2));
asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][5]) : “f”(a), “f”(a2));
asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][6]) : “f”(a), “f”(a2));
asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][7]) : “f”(a), “f”(a2));
//asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][8]) : “f”(a), “f”(a2));
//asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][9]) : “f”(a), “f”(a2));
//asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][10]) : “f”(a), “f”(a2));
//asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][11]) : “f”(a), “f”(a2));
//asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][12]) : “f”(a), “f”(a2));
//asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][13]) : “f”(a), “f”(a2));
//asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][14]) : “f”(a), “f”(a2));
//asm(“cvt.rz.bf16x2.f32 %0, %1, %2;\n” : “=r”(A[i][15]) : “f”(a), “f”(a2));
}

// B
// 1
for (int i = 0; i < 8; i++) {
    asm("cvt.rz.bf16x2.f32 %0, %1, %2;\n" : "=r"(B[0][0]) : "f"(b), "f"(b2));
    asm("cvt.rz.bf16x2.f32 %0, %1, %2;\n" : "=r"(B[0][1]) : "f"(b), "f"(b2));
    asm("cvt.rz.bf16x2.f32 %0, %1, %2;\n" : "=r"(B[0][2]) : "f"(b), "f"(b2));
    asm("cvt.rz.bf16x2.f32 %0, %1, %2;\n" : "=r"(B[0][3]) : "f"(b), "f"(b2));
    asm("cvt.rz.bf16x2.f32 %0, %1, %2;\n" : "=r"(B[0][4]) : "f"(b), "f"(b2));
    asm("cvt.rz.bf16x2.f32 %0, %1, %2;\n" : "=r"(B[0][5]) : "f"(b), "f"(b2));
    asm("cvt.rz.bf16x2.f32 %0, %1, %2;\n" : "=r"(B[0][6]) : "f"(b), "f"(b2));
    asm("cvt.rz.bf16x2.f32 %0, %1, %2;\n" : "=r"(B[0][7]) : "f"(b), "f"(b2));
}
asm volatile("{\n\t"
            ".reg .f16x2 %RA<2>, %RB<1>, %RC<4>;\n\t"
            "mma.sync.aligned.m16n8k8.row.col.f32.bf16.bf16.f32\n\t"
            "{%RC0,%RC1,%RC2,%RC3}, {%RA0,%RA1}, {%RB0}, {%RC0,%RC1,%RC2,%RC3};\n\t"
            "}\n\t"
           ::: "memory");
memcpy(out, C, 128);

}

int main() {
float* h_C = (float*)malloc(128);
float* d_C;
cudaMalloc(&d_C, 128);
mma_fp16_acc_fp32<<<1, 32>>>(d_C);
cudaDeviceSynchronize();
cudaMemcpy(h_C, d_C, 128, cudaMemcpyDeviceToHost);

}

j’ai essayer avec une matrixe a 816 and 168

thank

You were already doing that, more-or-less correctly, in the first code you posted. And as we have discussed, that code was giving you the correct/expected answer.

Your latest code suggests to me that you don’t understand some basics of CUDA programming, and also that you haven’t really grasped the description given in the documentation.

The m16n8k8 op expects that each thread in the warp will provide 4 elements of A, two elements of B, and 4 elements of C and 4 elements of D. Your first code was doing this. Now you have posted a code that starts by creating a thread-local array of 16x8 for C (and similarly for A,B). That is not how the op works, and it makes no sense to create a thread-local array of that size, for this op. So at this point it seems fairly clear that you don’t understand how a thread-local array in CUDA works (it is not visible to all threads - each thread has their own local copy of it) and it seems that you have not really internalized the idea that for this 16x8x8 operation, each thread is working collectively with other threads in the warp, and each thread contributes only a portion of the input and output array. The documentation calls these portions “fragments”. Yet you have written code that presumes that a single thread will provide all the elements of the array C (for example - we could make the same statement about A, or B).

If I were to suggest correction, I would simply copy and paste the first code you posted, and one or more of the results it provides. That is the correct way to use this op.

In the future, please use proper code formatting. There are several ways to do this, I shall describe one which strikes me as simplest and easiest to remember, syntactically. When putting code in the entry text-box, select the code, and then click the </> button at the top of the text entry box for entering your post. This will give your posted code an easier-to read setup.

1 Like

As an example demonstrating a “full” M=16, N =8, K=8 matrix multiply, let’s perform the following op:

D = A*B + C

we will set C to all zero. We will choose these values for A,B:

            A                         B                         D
 0  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1    9  9  9  9  9  9  9  9
 1  1  1  1  1  1  1  1     3  3  3  3  3  3  3  3   10 10 10 10 10 10 10 10
 2  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1   11 11 11 11 11 11 11 11
 3  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1   12 12 12 12 12 12 12 12
 4  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1   13 13 13 13 13 13 13 13
 5  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1   14 14 14 14 14 14 14 14
 6  1  1  1  1  1  1  1     1  1  1  1  1  1  1  1   15 15 15 15 15 15 15 15
 7  1  1  1  1  1  1  1 *   1  1  1  1  1  1  1  1 = 16 16 16 16 16 16 16 16
 8  1  1  1  1  1  1  1                              17 17 17 17 17 17 17 17
 9  1  1  1  1  1  1  1                              18 18 18 18 18 18 18 18
10  1  1  1  1  1  1  1                              19 19 19 19 19 19 19 19
11  1  1  1  1  1  1  1                              20 20 20 20 20 20 20 20
12  1  1  1  1  1  1  1                              21 21 21 21 21 21 21 21
13  1  1  1  1  1  1  1                              22 22 22 22 22 22 22 22
14  1  1  1  1  1  1  1                              23 23 23 23 23 23 23 23
15  1  1  1  1  1  1  1                              24 24 24 24 24 24 24 24

Before going any further, I would encourage you to convince yourself that the above linear algebra is correct.

The following then is code that implements that. Note that for simplicity, I have removed the use of bfloat16 and instead am using fp16. This should not matter conceptually for understanding how the op works:

$ cat t10.cu
#include <mma.h>
#include <cuda_fp16.h>
#include <iostream>
#include <stdio.h>

__global__ void mma_fp16_acc_fp32(float *out) {
    float c[4] = {0., 0., 0., 0.};
    float d[4] = {0., 0., 0., 0.};
    half a[4] = {1., 1., 1., 1.};
    half b[2] = {1., 1.};
    // the above would set our input matrices to all 1
    // now lets modify some values
    if (threadIdx.x%4 == 0) {
    // set the first column of A to be 0, 1, 2, 3, ... 15
      a[0] = threadIdx.x/4; a[2] = threadIdx.x/4 + 8;
    // set the second row of B to 3,3,3, ... 3
      b[1] = 3;}
    unsigned const *A = reinterpret_cast<unsigned const *>(&a);
    unsigned const *B = reinterpret_cast<unsigned const *>(&b);
    float const *C = reinterpret_cast<float const *>(&c);
    float *D = reinterpret_cast<float *>(&d);
    asm(
      "mma.sync.aligned.m16n8k8.row.col.f32.f16.f16.f32 "
      "{%0,%1,%2,%3}, {%4,%5}, {%6}, {%7,%8,%9,%10};\n"
      : "=f"(D[0]), "=f"(D[1]), "=f"(D[2]), "=f"(D[3])
      :
        "r"(A[0]), "r"(A[1]),
        "r"(B[0]),
        "f"(C[0]), "f"(C[1]), "f"(C[2]), "f"(C[3])
    );
    memcpy(out+threadIdx.x*2, D, 8);
    memcpy(out+8*8+threadIdx.x*2, D+2, 8);
}

int main() {
    float* h_C = (float*)malloc(16*8*sizeof(float));
    float* d_C;
    cudaMalloc(&d_C, 16*8*sizeof(float));
    mma_fp16_acc_fp32<<<1, 32>>>(d_C);
    cudaDeviceSynchronize();
    cudaMemcpy(h_C, d_C, 16*8*sizeof(float), cudaMemcpyDeviceToHost);
    for (int i = 0; i < 16; i++){
      for (int j = 0; j < 8; j++) std::cout << h_C[i*8+j] << " ";
      std::cout << std::endl;}

}
$ nvcc -o t10 t10.cu -arch=sm_75
$ cuda-memcheck ./t10
========= CUDA-MEMCHECK
9 9 9 9 9 9 9 9
10 10 10 10 10 10 10 10
11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12
13 13 13 13 13 13 13 13
14 14 14 14 14 14 14 14
15 15 15 15 15 15 15 15
16 16 16 16 16 16 16 16
17 17 17 17 17 17 17 17
18 18 18 18 18 18 18 18
19 19 19 19 19 19 19 19
20 20 20 20 20 20 20 20
21 21 21 21 21 21 21 21
22 22 22 22 22 22 22 22
23 23 23 23 23 23 23 23
24 24 24 24 24 24 24 24
========= ERROR SUMMARY: 0 errors
$

In order to understand the if statement in the kernel code (and likewise for the final memcpy statements), which is selecting specific matrix rows and columns, by selecting particular elements of the fragments distributed across the warp, I encourage you to study the charts indicating fragment organization in the documentation.

I would also point out that this code is just for understanding the behavior of the selected op. I’m not suggesting this code is how you would write a bulk, efficient matrix-matrix multiply routine. For that I would refer you to CUTLASS.

1 Like

thank you very much for your help Robert_Crovella

I really appreciate you helping me out and more than that having made me understand a lot about cuda i have been coding in cuda for 2 weeks and started coding not long ago, i didn’t hadn’t understood that threads are executed natively I thought parallelism occurs when instructions like Mma or for

I would surely not have succeeded in this project which is close to my heart without your help, the performances obtained are much superior to the state of the art thanks to the modularity that cuda allows

my matrix implementation will only occur on a wrap and a single inference, the optimization will mainly occur at the level of the transition from global memory to sherad memory but thanks to the new direct instruction of the ampere architecture it is already in progress optimization and after I will use ldmatrix for the b16 fragments,
I think this way I’m going to have to be close to an efficient model, and my first mistake of always passing the same fragment gave me an optimized solution to go from matrix-matrix multiplication to a vector matrix