The warp size is independent of the number of cores per SM. Pre-Fermi devices had only 8 cores per SM, and the warp size was still 32. In compute capability 2.0 cards, there are 32 cores, but (except for double precision instructions), the 32 cores don’t run the same warp. Instead, the two instruction schedulers each issue a different instruction per clock to two groups of 16 cores within the SM. In compute capability 2.1, they added a third group of 16 cores so that one of the instruction schedulers can decide to co-issue two independent instructions from the same warp. As a result, compute capability 2.1 devices can complete up to 3 warp instructions every 2 clocks, given a favorable instruction stream.