This has to do with SM design, and SM design is something that tries to take into account maximizing performance against resource utilization and die area, for the codes that the GPU designers have access to.
The cc6.0 GPU SM architecture in some ways can be viewed as “half” of a cc 6.1 or cc 6.2 SM. It has half the cores, for example, and half the warp schedulers (study the corresponding whitepapers). In this case it also appears to have “half the throughput” of the cc6.1 and cc6.2 SMs.
The cc 7.0 SM shares a lot of similarity with the cc6.0 SM.
There are various similarities between the 5.x and 6.1/2 SM designs.
Kepler (3.x) had a “huge” SM design, capable of issuing up to 8 warps in a cycle, as did 5.x and 6.1/2. Effectively, some later GPUs “pared down” the size of the SM. (However later GPUs also had/have many more SM per die.)
Beyond that, I think you have to look at the various changes as simply a balancing act of SM die area vs. performance gained, as measured on a suite of codes that the GPU designers consider to be “relevant” or “current” as they are making their SM design choices.