The warp size, I think, is dictated by the hardware. Each multiprocessor only has one instruction decoder, so each of the 8 (on current cards) stream processors needs to run the same instruction. That means the minimum warp size possible is 8. The stream processors are also pipelined, so for maximum efficiency you are going to need 2 instructions in flight to keep the pipeline stages busy. The easiest way to do that is to run the same instruction you already decoded, but for another set of threads, which doubles the warp size up to 16. There is also a clock rate difference between the instruction decoder and the stream processors, so perhaps you will need some extra time to decode the next instruction, so doubling the warp size again to 32 seems plausible. (I am fuzzy on the last step. Someone who really knows the Nvidia hardware would have to comment if I got the details right.)
So, I think the size of the warp on current hardware could not be reduced without causing parts of the chip to be underused. This is probably a general result, in fact. The best warp size for any chip will be the smallest one that can keep the chip busy 100% of the time. A bigger one will have no benefits (and possibly underutilize the instruction decoder), and a smaller one will underutilize the stream processors as they wait for new instructions.
For future products, I guess it is a design tradeoff: If you spend transistors on more instruction decoders, then the warp size can be reduced, making divergent kernels faster. But if you spend the transistors on more stream processors per multiprocessor, then you will finish highly parallel jobs faster. Given that Nvidia’s main business is selling cards for graphics, I bet the deciding point on that will be the best balance for 3d rendering. :)