In mma.sp, what is Ti of metadata?

202476410arsmart · October 28, 2023, 7:42am

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-sparse-mma-m16n8k16-with-f16-and-bf16-types

Robert_Crovella · October 28, 2023, 2:01pm

Ti refers to the the thread. A tensor core op is a warp-wide operation. Each thread in the warp holds input (and output) for the op. In the case of the metadata, it is also contained in one register per thread in the warp. The table you have excerpted shows, for each thread, which area its metadata applies to.

In figure 83, we see that a sparse matrix suitable for this kind of sparse matrix-matrix multiply, has a particular sparsity pattern. You cannot have arbitrary sparsity pattern. Instead, considering each 4-way square set or “chunk” of elements, exactly 2 of those 4 elements are allowed to be significant, and the other two must be zero.

This metadata selects which quadrants of the square chunk have non-zero data. An example of the relationship between chunk arrangement and metadata is given in figure 84.

202476410arsmart · October 30, 2023, 9:05am

I see… But why here especially use Ti? Otherwhere uses T0, T1…

Also I see a T_2i. What does this mean?

liqi123 · September 14, 2024, 5:31am

I have the same qusetion.@Robert_Crovella, could you please explain it in detail? What is the meaning of “i” here?

Robert_Crovella · September 14, 2024, 1:32pm

referring to section 9.7.15.5.1 in the current (PTX 8.5/CUDA 12.6) doc “Sparse Matrix Storage”, we see the following:

In a group of four consecutive threads, one or more threads store the metadata for the whole group depending upon the matrix shape. These threads are specified using an additional sparsity selector operand.

And we note that the difference between fragment storage (which is held by every thread) diagrams and metadata storage (which is not) diagrams are this additional notation e.g. Ti, or T2i. We further note that for the m16n8k16 shape, the metadata is held by one out of every 4 threads and the corresponding metadata storage diagram use Ti notation, and for m16n8K32 shape, the metadata is held by 2 out of every 4 threads, and the corresponding diagram uses the T2i notation.

For the m16n8k16 case, figure 86 shows a storage pattern, with the notation that “sparsity selector 0 indicates thread T0 (out of 4 shown) holds the metadata”.

Referring to the instruction description, we see that the sparsity selector is operand f, it is a 32 bit integer constant, and is constrained to values in the range of 0…3.

liqi123 · October 26, 2024, 2:22pm

@Robert_Crovella Thank you very much. Could you please tell me that is there an API for Sparse Tensor Core like WMMA API for Tensor Core? And if not, could you give us a standard example to use Sparse Tensor Core by PTX? Because I find it very difficult to find an example in google.

Robert_Crovella · October 26, 2024, 4:37pm

at the moment, there is no sparse functionality exposed here. I’m not sure if that is what you are referring to with “WMMA API”.

I’m not sure when I will have time to assemble one. In the meantime, cutlass implements sparse gemm. You could either use it directly, or study it for an implementation example.

Topic		Replies	Views
WMMA - What does "warp matrix operations" mean? CUDA Programming and Performance cuda , kernel	7	5236	October 18, 2022
[Question] How does the threads in a warp work collectively? CUDA Programming and Performance	5	143	July 8, 2024
Get wrong result using tensor core example CUDA Programming and Performance cuda , kernel	8	265	July 1, 2024
When working on elements of fragments directly, is it computed inside tensor core or CUDA core? CUDA Programming and Performance	2	38	September 14, 2024
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	999	December 8, 2018
Using Tensor Cores in CUDA Fortran Technical Blog	0	426	April 15, 2021
Can we directly use register value for tensor core calculation? CUDA Programming and Performance	4	534	October 18, 2023
How does it compute exactly in Tensor Core? CUDA Programming and Performance	10	441	August 22, 2024
Understanding CUTLASS Permuted Shared Memory layout CUDA Programming and Performance	6	359	September 12, 2024
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	841	November 15, 2023

In mma.sp, what is Ti of metadata?

Related topics