So a warp can be executed in 4 clock cycles assuming not divergent correct? 8 cores, 1 thread/core so 4 clock cycles would be needed.
Say there are four divergent points in execution of the warp, but the disabled threads for any execution path is sparsely distributed throughout the 32 threads.
So no matter what, as soon as the code diverges the effective clock cycles is double correct? Cause the hardware/drivers have no way to organize the enabled threads. Plus i believe the single instruction is loaded and executed over all enabled threads (where we get the speed from).
Anyway to improve this type of divergent code besides attempting to organize data that diverges?