I don’t believe anyone can give you a solid, quantitative answer as to why 32 was selected, but there are certainly good conjectures (that you allude to) for why NVIDIA would make the warp size 32 and not 8:
With a warp larger than the number of CUDA cores, the instruction decoder logic can run at the lower core clock rate rather than the higher shader clock rate. The recent Anandtech review of the GTX 580 mentions that NVIDIA used two types on transistors on their chips before GF110: A slower, low leakage current design (good for power consumption), and a faster, high leakage current design (good for speed). A wide warp would allow them to use the slow transistor design for instruction decoding and save power.
Additionally, as a highly pipelined architecture, a CUDA multiprocessor trades latency for throughput. An instruction can take many clocks (I’ve seen estimates that this is > 20 for simple instructions) from start to finish, but pipelining ensures that in the compute capability 1.x chips, one simple instruction for an entire warp finishes every 4 clocks. The main implementation tradeoff with pipelining is inter-instruction dependency. Pipelining is a form of instruction-level parallelism, requiring instructions that do not depend on their immediate predecessors in order to be fully effective. On a single threaded architecture, this can be very complicated to ensure, especially in the face of branches. Branch prediction and instruction reordering is such a hot item in CPU designs because it helps ensure that you don’t have to flush the pipeline and lose all that throughput benefit. These features take up valuable chip area that could be spent on more arithmetic units. The longer the pipeline, the harder you have to work.
Because CUDA is basically SIMD programming wearing a different hat, you have a second source of parallelism to exploit in your pipeline. By construction, threads in a warp can’t depend on each other except for a handful of synchronization instructions. By putting 4 threads into the pipeline for each CUDA core, you buy 4 clocks of guaranteed independence, making your pipeline (from a management perspective) 1/4 as long.
Clearly, NVIDIA decided to get more agressive about the instruction-level parallelism in Fermi, because now there are 32 CUDA cores per MP, and two warps finish every two clocks. Interesting to note that they chose to group the CUDA cores into two sets of 16, each running a separate warp instruction. That means they still get some benefit from guaranteed thread independence in the pipeline, and might explain in part why they chose not to go with one warp every clock on all 32 CUDA cores at once. (Also note that this organization means that they have two instruction decoders which can each run slower rather than one very fast instruction decoder.)
(Interleaving 4 instructions from the same warp would be tremendously complicated and have little benefit, so I doubt that is the scheduling strategy.)