How does the operation like "some_fragment.x[index]" work in wmma api?

834264619 · March 25, 2024, 4:27pm

I have to use the wmma api to do some matrix multiplication, and I want to access every single element in the fragment to do some operation, maybe simple as multiply a variable or something more complex.

Below is the example. At the beginning, I think that I should let thread access different index according to the threadIdx, like thread 0 operate x[0] to x[7], thread 1 operate x[8] to x[15], but is doesn’t work.

After some tries, I find out that it seems every single have different acces to the fragment, like for thread 0, x[0] maps to the first element in the fragment, x[1] maps to the second, x[2] maps to 8th, x[3] maps to 9th… So, how does it maps? I don’t find any instructions in any docs(or maybe I missed)

#include <cuda_runtime.h>
#include <mma.h>
#include <iostream>
using namespace nvcuda;

__global__ void wmma_ker(half *a, half *b, float *c) {
   wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
   wmma::fill_fragment(c_frag, 0.0f);
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
   
   // here is the operation
   for (int i = 0; i < 16; i++){
        c_frag.x[i] = c_frag.x[i] * threadIdx.x;
   }
   
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}


void print_data(half *data){
    for (int i = 0; i < 16; i++){
        for (int j = 0; j < 16; j++){
            std::cout << __half2float(data[16*i+j]) << ",";
        }
        std::cout << std::endl;
    }
    std::cout << std::endl << std::endl;
}



int main(){
    half* data_a_host = (half*)malloc(sizeof(half) * 16 * 16);
    half* data_b_host = (half*)malloc(sizeof(half) * 16 * 16);
    float* data_c_host = (float*)malloc(sizeof(float) * 16 * 16);
    for (int i = 0; i < 256; i++){
        data_a_host[i] = half(i);
        data_b_host[i] = half(i);
    }
    print_data(data_a_host);
    print_data(data_b_host);
    half* data_a, *data_b;
    float* data_c;
    cudaMalloc(&data_a, sizeof(half) * 16 * 16);
    cudaMalloc(&data_b, sizeof(half) * 16 * 16);
    cudaMalloc(&data_c, sizeof(float) * 16 * 16);
    cudaMemcpy(data_a, data_a_host, sizeof(half) * 256, cudaMemcpyHostToDevice);
    cudaMemcpy(data_b, data_b_host, sizeof(half) * 256, cudaMemcpyHostToDevice);
    wmma_ker<<<1, 32>>>(data_a, data_b, data_c);
    cudaDeviceSynchronize();
    cudaMemcpy(data_c_host, data_c, sizeof(float) * 256, cudaMemcpyDeviceToHost);
    for (int i = 0; i < 16; i++){
        for (int j = 0; j < 16; j++){
            std::cout << data_c_host[16*i+j] << ",";
        }
        std::cout << std::endl;
    }
    cudaFree(data_a);
    cudaFree(data_b);
    cudaFree(data_c);
    return 0;
}

Robert_Crovella · March 25, 2024, 4:37pm

The mapping is intentionally not given, not specified, and should not be relied upon, if discovered experimentally.

The only usage for individual access to fragment elements is when the operation to be done is uniform across all threads in the warp and all fragment elements. From the documentation:

Because the map of matrix elements into each thread’s fragment is unspecified

In the special case where all threads in the warp will apply an element-wise operation uniformly to all fragment elements, direct element access can be implemented using the following fragment class members.

For people who are looking for direct control of the matrix-multiply operands, I usually offer the suggestion to use PTX mma instructions, instead. Here is an example.

834264619 · March 26, 2024, 2:51pm

Thanks for your reply. Actually, I only need to do matrix multiplication and element-wise multiplication, so I think wmma is enough for me. Thanks to your reply, now I know how to deal with element-wise multiplication.

But I still have another question. If I want to copy one fragment to another fragment, like copy C to D. Comparing to store the C to shared memory and then load to D, will operations like this be faster?

for(int i = 0; i < C.num_elements; i++) D.x[i] = C.x[i];

Robert_Crovella · March 26, 2024, 3:26pm

I have no idea. If it were me, and I were using wmma operations, I would use the load and store functionality to copy one fragment to another.

system · April 9, 2024, 3:27pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Access WMMA fragment elements by threadID CUDA Programming and Performance	1	619	November 10, 2023
WMMA - What does "warp matrix operations" mean? CUDA Programming and Performance cuda , kernel	7	7461	October 18, 2022
Fastest Tiled WMMA for Matrices of Any Size? CUDA Programming and Performance	3	390	October 26, 2024
[Question] How does the threads in a warp work collectively? CUDA Programming and Performance	5	245	July 8, 2024
Problem in Tensor Cores CUDA Programming and Performance	1	422	April 19, 2021
CUDA tensor core register mapping? CUDA Programming and Performance	5	1087	January 26, 2024
Question about efficient usage of wmma CUDA Programming and Performance	2	395	February 29, 2024
Bank Conflicts When Using wmma::load_matrix in CUDA without Swizzle? CUDA Programming and Performance	0	205	September 12, 2024
Is loading the matrices in like this good practice for WMMA instructions in C++ CUDA? CUDA Programming and Performance cuda	0	63	December 30, 2024
WMMA vs. MMA CUDA Programming and Performance	2	2138	January 7, 2025

How does the operation like "some_fragment.x[index]" work in wmma api?

Related topics