I would appreciate you throw light upon warp execution details.
Qute from programming guide:
However, warp size in GeForce 8800 GTX is 32. And number of ALU in multiprocessor is 8 AFAIK (correct me pls if I am not right). Therefore all threads from the warp can not be executed simultaneously. Are they executed by 1/4 warp portions? If yes instruction from another warp can not be executed before all 1/4 of executing one have not finished current instruction.