When working on elements of fragments directly, is it computed inside tensor core or CUDA core?

tugrul_192bit · September 14, 2024, 6:58pm

For example, taking bitwise operation of lower-nibble of each fragment element or taking square root of each element & assigning to another fragment’s same indexed element.

I only found examples like this:

    nvcuda::wmma::load_matrix_sync(c_frag, c + indexWarp * 16 * 16, ldc, nvcuda::wmma::mem_col_major);

    // all warp threads need to execute this
    for (int i = 0; i < c_frag.num_elements; i++)
        c_frag.x[i] += acc_frag.x[i];

    nvcuda::wmma::store_matrix_sync(c + indexWarp * 16 * 16, c_frag, ldc, nvcuda::wmma::mem_col_major);

What if the addition operation was square root instead? Does tensor core include a dedicated square-root unit inside?

    for (int i = 0; i < c_frag.num_elements; i++)
        c_frag.x[i] = sqrt(acc_frag.x[i]);

Second question: if I “load” values into a fragment and if warp ends, can another warp load the same values without any load/store but directly using a defined fragment? Does CUDA allow usage of garbage values left from another block/grid/warp inside the same tensor core hardware(assuming same core was found by two different warps in different blocks). Only wondering if this can be used as a fast broadcasting mechanism from first block to all other blocks.

Lastly, is there any method to represent a scalar value as a 16x16 matrix and to define its square-root by some series of matrix-matrix multiplications (but fast inside tensor core) as a linear-algebraic way of optimizing it for hardwares that have no sqrt unit (from first question)?

Robert_Crovella · September 14, 2024, 8:47pm

the only thing that executes on the tensor core is the matrix-multiply op
load/store and fragment ops do not use TC (matrix-multiply) hardware. TC hardware has no sqrt capability. If you so a sqrt as a fragment op, and study the resultant SASS code, you can see that it is performed in a fashion similar to any other sqrt issued.

A fragment is loaded (by load_matrix_sync) into a register footprint that spans each thread in a warp. registers are local to threads; they are not directly accessible from other threads, whether those threads are in the same warp or different warps.

system · September 28, 2024, 8:48pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
WMMA - What does "warp matrix operations" mean? CUDA Programming and Performance cuda , kernel	7	6336	October 18, 2022
How does the operation like "some_fragment.x[index]" work in wmma api? CUDA Programming and Performance cuda , kernel	4	558	March 26, 2024
What is the best way to re-use a tensor core C fragment now as A or B input when their types differ? CUDA Programming and Performance	5	715	November 24, 2023
Use cuda core & tensor core at the same time CUDA Programming and Performance	6	309	September 29, 2024
About tensor core's flops/clk and wmma shape? CUDA Programming and Performance	1	940	October 22, 2023
[Question] How does the threads in a warp work collectively? CUDA Programming and Performance	5	167	July 8, 2024
sample code for Integer arithmetics in RTX tensor cores? CUDA Programming and Performance	2	560	January 23, 2019
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	980	November 15, 2023
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	1014	December 8, 2018
Using Tensor Cores in CUDA Fortran Technical Blog	1	439	March 7, 2025

When working on elements of fragments directly, is it computed inside tensor core or CUDA core?

Related topics