Ti refers to the the thread. A tensor core op is a warp-wide operation. Each thread in the warp holds input (and output) for the op. In the case of the metadata, it is also contained in one register per thread in the warp. The table you have excerpted shows, for each thread, which area its metadata applies to.
In figure 83, we see that a sparse matrix suitable for this kind of sparse matrix-matrix multiply, has a particular sparsity pattern. You cannot have arbitrary sparsity pattern. Instead, considering each 4-way square set or “chunk” of elements, exactly 2 of those 4 elements are allowed to be significant, and the other two must be zero.
This metadata selects which quadrants of the square chunk have non-zero data. An example of the relationship between chunk arrangement and metadata is given in figure 84.
I see… But why here especially use Ti? Otherwhere uses T0, T1…
Also I see a T_2i. What does this mean?
I have the same qusetion.@Robert_Crovella, could you please explain it in detail? What is the meaning of “i” here?
referring to section 9.7.15.5.1 in the current (PTX 8.5/CUDA 12.6) doc “Sparse Matrix Storage”, we see the following:
In a group of four consecutive threads, one or more threads store the metadata for the whole group depending upon the matrix shape. These threads are specified using an additional sparsity selector operand.
And we note that the difference between fragment storage (which is held by every thread) diagrams and metadata storage (which is not) diagrams are this additional notation e.g. Ti, or T2i. We further note that for the m16n8k16 shape, the metadata is held by one out of every 4 threads and the corresponding metadata storage diagram use Ti notation, and for m16n8K32 shape, the metadata is held by 2 out of every 4 threads, and the corresponding diagram uses the T2i notation.
For the m16n8k16 case, figure 86 shows a storage pattern, with the notation that “sparsity selector 0 indicates thread T0 (out of 4 shown) holds the metadata”.
Referring to the instruction description, we see that the sparsity selector is operand f, it is a 32 bit integer constant, and is constrained to values in the range of 0…3.
@Robert_Crovella Thank you very much. Could you please tell me that is there an API for Sparse Tensor Core like WMMA API for Tensor Core? And if not, could you give us a standard example to use Sparse Tensor Core by PTX? Because I find it very difficult to find an example in google.
at the moment, there is no sparse functionality exposed here. I’m not sure if that is what you are referring to with “WMMA API”.
I’m not sure when I will have time to assemble one. In the meantime, cutlass implements sparse gemm. You could either use it directly, or study it for an implementation example.