execution of thread from the same warp

Hi, all
In the document OpenCL Programming Guide for the CUDA Architecture,
it reads
"Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
If individual threads have their own instruction address counter and their instruction address counters are different, how does the SM decode the instruction for the warp?

The SM of Fermi architecture has 16 cores, but each warp is composed of 32 threads,
does the execution portion of instructure execution pipeline operate the first
16 threads, then the other 16 threads? For me, it is a litter weird.
Why not the SM has 32 cores?