Thread divergence reduces performance because when divergence occurs, its effect is that not all threads in the warp do useful work: some are masked off, i.e. currently inactive. The corresponding execution resources are idle, and cannot be used by threads from other warps. Even so, even when some or most threads in a warp are masked off, the threads in the warp execute in lockstep because there is only one program counter for the entire warp, not one program counter per individual thread as in a classical CPU.
The closest thing to a CUDA “thread” in a CPU are the SIMD lanes in CPUs, which are also maskable in recent x86 architecture versions. The difference of CUDA’s SIMT and classical explicit SIMD is that the SIMDness of the hardware, including the masking of the SIMD lanes, is mostly abstracted away, making it implicit SIMD, which is a lot nicer for programmers to deal with because it provides a single-thread view of program execution most of the time.
The lockstep execution of the threads in a warp gives certain desirable guarantees about the behavior of the threads in the warp relative to each other. These guarantees are being exploited by warp-synchronous programming techniques, but they are frequently misunderstood by less experienced CUDA programmers: many are weaker than they are perceived to be, and they may be difficult to correlate with HLL code. This then causes unexpected program behavior, which may not be immediately obvious, compounding the problem: the affected code seems to work perfectly in some circumstances but not others.