When working on elements of fragments directly, is it computed inside tensor core or CUDA core?

For example, taking bitwise operation of lower-nibble of each fragment element or taking square root of each element & assigning to another fragment’s same indexed element.

I only found examples like this:

    nvcuda::wmma::load_matrix_sync(c_frag, c + indexWarp * 16 * 16, ldc, nvcuda::wmma::mem_col_major);

    // all warp threads need to execute this
    for (int i = 0; i < c_frag.num_elements; i++)
        c_frag.x[i] += acc_frag.x[i];

    nvcuda::wmma::store_matrix_sync(c + indexWarp * 16 * 16, c_frag, ldc, nvcuda::wmma::mem_col_major);

What if the addition operation was square root instead? Does tensor core include a dedicated square-root unit inside?

    for (int i = 0; i < c_frag.num_elements; i++)
        c_frag.x[i] = sqrt(acc_frag.x[i]);

Second question: if I “load” values into a fragment and if warp ends, can another warp load the same values without any load/store but directly using a defined fragment? Does CUDA allow usage of garbage values left from another block/grid/warp inside the same tensor core hardware(assuming same core was found by two different warps in different blocks). Only wondering if this can be used as a fast broadcasting mechanism from first block to all other blocks.


Lastly, is there any method to represent a scalar value as a 16x16 matrix and to define its square-root by some series of matrix-matrix multiplications (but fast inside tensor core) as a linear-algebraic way of optimizing it for hardwares that have no sqrt unit (from first question)?

the only thing that executes on the tensor core is the matrix-multiply op
load/store and fragment ops do not use TC (matrix-multiply) hardware. TC hardware has no sqrt capability. If you so a sqrt as a fragment op, and study the resultant SASS code, you can see that it is performed in a fashion similar to any other sqrt issued.

A fragment is loaded (by load_matrix_sync) into a register footprint that spans each thread in a warp. registers are local to threads; they are not directly accessible from other threads, whether those threads are in the same warp or different warps.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.