In Fermi architecture a SM has 32 CUDA cores and two warp schedulers
So if there are 12 warps and say warp 8 came to 1st scheduler and warp 9 to 2nd scheduler
So as each warp has 32 threads, it takes 2 clock cycles to execute for each warp. So for the execution of 1 instruction from each of warp 8 and 9 it takes 2 clock cycles.
Is my assumption correct or something like frequency of the cores is double so that they execute in only one clock cycle w.r.t the warp scheduler