“CUDA cores” are a useful marketing measure, but tell you very little about how an SMX executes instructions.
Take a look at Figure 2 in this whitepaper:
At the top are the 4 schedulers, which pick from the set of available warps (groups of 32 threads) on the SMX. They do not pick individual threads for execution. Once each has selected a warp for execution, the dual-dispatchers then pick up to two independent instructions from the warp to issue to the appropriate pipelines. Each column of 16 “cores” is a warp-pipeline, which can finish one warp instruction every 2 clocks (but an instruction takes something like 10-20 clocks to complete from start to finish). As a result, each of the 12 pipelines only needs a new instruction every 2 clocks to stay full. This means the SMX needs to issue arithmetic instructions from 6 of the 8 dispatchers every clock to keep all the CUDA cores busy. Load/store and special functions are separate pipelines.