performance cost due to for loop divergence



when the loop within a thread will end depends on data the thread accesses.

Will the divergence affect performance seriously? From my point of view, it won’t because in the divergence, some threads stop and others continue. These are not exact “two” branches.

Am I right?

the stopped threads are wasting computational resources by leaving some GPU hardware (ALUs, etc) unused. That is unless all 32 threads of a warp have stopped - in this case there is no performance impact because the entire warp is no longer being scheduled.

Though wastes computational resources with stopped threads, this case will not be worse than all threads running, right?
I mean the time for the threads to stop not mean the total throughput.

when currently 50% of your threads are doing work, and these active threads are dispersed randomly across all warps, then your instantaneous throughput cannot exceed 50% of peak.