Thread divergence improvement techniques

Hi everyone,
I know that when there occurs a data dependent instruction within a thread then the so called SIMD efficiency decreases. I have seen many papers to improve this like the post dominator technique which uses stack based re-convergence, dynamic warp formation, large warp architecture, thread block compaction and so on…
I anything implemented on the hardware…
How does NVIDIA Fermi and Kelper architectures respond to this… I have seen the whitepapers on these architectures but did not found the exact information. Can anyone please help me…