I have two questions regarding CUDA:-
How can simple loop unrolling sometimes raise the peak performance of a kernel by huge factors? Is it just the instruction overhead or some kind of out-of-order processing by G80?
What (if any) kind of dynamic branch prediction is employed in the G80 processors? I tried searching the manual (and this forum) and it seems the only branch prediction is done by the compiler. Is there any more information in the public domain?