Why 32 SP/SM in Fermi instead of 16SP/SM?

The Fermi white paper states that there are 32 SPs in one SM in Fermi. To my understanding, that means each SM is a 32-way SIMD unit.

But each SM is also equipped with 2 issue units, which is called dual-issue. The two issue units can decode two different instructions and issue them to half of the SM. This seems like two independent 16-way SMs. So why bother putting them together to form a 32-way SIMD?

Also, there are 16 load/store units per SM. So when two load/store instructions are dual-issued, does it mean that the two instructions will have to be executed in two cycles?

i would guess that it allowed to double sp count not requiring to double shared mem and register file.

Good point.

The real reason for the two warps per SM is that the ALUs can be linked together for double precision. Since instruction issue is only once every two clock cycles, it takes 16 SPs to fill a 32 wide warp every two clocks.