[Question] How does the threads in a warp work collectively?

aaron_zlt · July 7, 2024, 2:00pm

I’m very new to cuda and cuda ptx, my question after reading the ptx (which is very hard for me to understand) is that, for example m16n8k8, since each thread holds 4 elements from A and 2 from B, how do they collectively ‘sum’ each other’s result to generate the final 16x8 matrix?

Robert_Crovella · July 7, 2024, 10:41pm

presumably you are referring to matrix-multiply instructions.

The ptx instructions that begin (for example) with mma are “feeding” a functional unit in the SM that does all the work to produce the result. The threads are not executing microcode or otherwise collaborating, except insofar as they feed data to the functional unit (in the form of a register patch holding input data), and insofar as they receive the result from the functional unit in the form of a register patch holding the output data.

Conceptually, this is similar to other functional unit behavior. The SP (FP32) units, for example, when processing a FFMA instruction, receive input data from registers and put their output data into registers. The “threads” do not otherwise assist in the generation of the result, other than to feed the functional unit.

One difference with the matrix-multiply instructions is of course that an entire warp must participate. However the combining of input data from different threads is not done by the threads themselves, but rather by the tensorcore unit. The tensorcore unit then distributes results back to each thread’s register patch.

aaron_zlt · July 8, 2024, 4:09am

Thanks! It really is a very basic concept of tensorcore lol. Suggests that for now we only need to focus on the data layout, which is critical to the output.

After that I was using wmma (which I think the layout corresponding to the thread is similar to mma) and performing a direct access to the data in the fragment (just want to see how the data is stored).

// m16n16k16 wmma
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> frag_in;

However I found that in frag_in (16 elements in total of frag_in), the first eight elements are the same as the last eight elements.

Is this normal?

Robert_Crovella · July 8, 2024, 6:19am

AFAIK the data layout for the wmma instructions is unspecified. Opaque load and store instruction(s) are provided.

Curefab · July 8, 2024, 4:34pm

Just create some code like:

		half2 b{ threadIdx.x, threadIdx.x + .5 };
		unsigned int B0 = reinterpret_cast<unsigned int&>(b); // actually UB, but usually accepted by nvcc
		asm volatile(
			"mma.sync.aligned.m16n8k8.row.col.f16.f16.f16.f16 {%0, %1}, {%2, %3}, {%4}, {%5, %6};\n"
			: "=r"(D0), "=r"(D1)
			: "r"(A0), "r"(A1), "r"(B0), "r"(C0), "r"(C1));
		);

A0 and A1 similar as B0, set C0 and C1 to zero and try out, which results appear where.
Then you can compare with the documentation.

system · July 22, 2024, 4:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
WMMA - What does "warp matrix operations" mean? CUDA Programming and Performance cuda , kernel	7	5251	October 18, 2022
Questions about mma instruction with Nvidia ptx CUDA Programming and Performance cuda	1	79	July 15, 2024
Turing 16x16 MMA, SM usage, 1 or 2? CUDA Programming and Performance	2	999	December 8, 2018
Wmma f16 load always loads into 8 2xf16 registers CUDA Programming and Performance	4	563	September 9, 2023
Question about wgmma instruction in Hopper CUDA Programming and Performance	3	161	October 25, 2024
How does 4x4 mma at tensor core level translate to 16x16 mma at warp level? CUDA Programming and Performance cuda	2	842	November 15, 2023
Warp thread Scheduling CUDA Programming and Performance	7	2242	June 28, 2010
Error or incomprehension, MMa ptx mixed precision Bfloat16 rtx3080 CUDA Programming and Performance	20	2141	October 12, 2021
What is the best way to re-use a tensor core C fragment now as A or B input when their types differ? CUDA Programming and Performance	5	646	November 24, 2023
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15514	February 4, 2011

[Question] How does the threads in a warp work collectively?

Related topics