For example, taking bitwise operation of lower-nibble of each fragment element or taking square root of each element & assigning to another fragment’s same indexed element.
I only found examples like this:
nvcuda::wmma::load_matrix_sync(c_frag, c + indexWarp * 16 * 16, ldc, nvcuda::wmma::mem_col_major);
// all warp threads need to execute this
for (int i = 0; i < c_frag.num_elements; i++)
c_frag.x[i] += acc_frag.x[i];
nvcuda::wmma::store_matrix_sync(c + indexWarp * 16 * 16, c_frag, ldc, nvcuda::wmma::mem_col_major);
What if the addition operation was square root instead? Does tensor core include a dedicated square-root unit inside?
for (int i = 0; i < c_frag.num_elements; i++)
c_frag.x[i] = sqrt(acc_frag.x[i]);
Second question: if I “load” values into a fragment and if warp ends, can another warp load the same values without any load/store but directly using a defined fragment? Does CUDA allow usage of garbage values left from another block/grid/warp inside the same tensor core hardware(assuming same core was found by two different warps in different blocks). Only wondering if this can be used as a fast broadcasting mechanism from first block to all other blocks.
Lastly, is there any method to represent a scalar value as a 16x16 matrix and to define its square-root by some series of matrix-matrix multiplications (but fast inside tensor core) as a linear-algebraic way of optimizing it for hardwares that have no sqrt unit (from first question)?