How does four subcores in a single SM share two FP64 cores?

adeagle · September 10, 2024, 2:51am

I run a kernel with 128 blocksize {32(warp)*4(subcore)},ncu profing smsp__pipe_fp64_cycles_active.max metrics
sm7.5 64 cycle; sm8.6 16cycle. why

Greg · September 10, 2024, 4:47pm

In SM7.5 (TU1xx) and SM8.6 (GA10x) the FP64 execution unit is shared between SM sub-partitions. The SM sub-partition warp scheduler issues the warp instruction an instruction queue in the MIO unit. The MIO controller is responsible for arbitrating between instruction queues and dispatching warp instructions to the shared unit. This method is used for all shared units in the SM including but not limited to load store unit (local, global, shared), texture unit, branch unit, etc.

The term “core” is used to define the number of lanes (threads) that can be issued per cycle. Do not equate a “CUDA Core” or “Tensor Core” or “FP64 Core” to a CPU core.