Is wgmma executed on the same EU or 4 EUs?

Assume a warpgroup has 4 contiguous warps which are working collectively.
So should a warpgroup with all of its 4 warps be dispatched to the same EU and executed on that EU’s tensor core?
Because traditionally one warp can be dispatched to one EU. The next warp is dispatched to the next EU.

Also, is only the “head warp” e.g.: warp rank=0 representing the warpgroup to execute the wgmma instruction? Or eventually a wgmma instruction breaks down into 4 low level wmma instructions and working in lock-step style - e.g.: B matrix is loaded once from shared local memory but used 4 times one for each low level wmma instruction?

I have not looked in detail into the warpgroup wmma functionality, but I had assumed that each of the 4 warps is assigned to a different SM Partition, so the warpgroup wmma instructions collectively use all 4 tensor core execution units.

wmma instructions are high-level and assembled into mma instructions.

I looked it up, the 4 warps do not seem to exchange data directly. The matrix A has the 4 warps divided along the rows and the matrix B must be in shared memory, so each warp loads matrix B from shared memory.

So the only performance advantage could be the asynchronous execution, and this relates more to loading from shared memory than computing the mma instruction.