While loops in kernels, question about thread divergence

What’s the impact on thread divergence if a while loop is used inside the kernel? Say one kernel needs 500 iterations, another kernel needs 800 iterations. Will the 500 iterations kernel just be idle when the other 300 iterations happen, or are we talking 500+800=1300 iterations total time? (and then all the other kernels, it’s of course not just two). How does thread divergence work with while loops?

when you said “one kernel needs 500 iterations”, you probably meant “one thread … another thread …”.

no, thread divergence is on the instruction level, not a loop level - if a while loop executes the same instruction sequentially on all threads within a warp, there is no divergence. it only diverges when one thread tries to execute one instruction while others in the warp want to execute a different instruction (not data).

if there are conditions in the loop to allow one thread to run 500 repetitions, while another thread in the same warp to run 800 repetitions, yes, there will be minimal 300 iteration idling on those who finished early, what you can do is to add another level of loop outside the while loop, so that each thread runs the jobs equivalent to multiple threads in the original job partition, this way, the uneven job loads are somewhat averaged out. Another way is to use a counter in the shared memory to record the remaining iterations/passes, dynamically (atomically) decrease this counter when one thread claims one of the remaining jobs. this allows the load to be more evenly distributed.

first para, yes, I used wrong terminology. When I think “loop” I think “code”, and thus I used the word “kernel” (because that is code). You are right, I’m talking about threads.

third para, could you point me to some code references or papers? I don’t grok this at all.

500 and 800 was an extreme example to explain my point. In real life my problems will only differ by maybe 1 - 2 loop iterations. And it helps to know that thread divergence is limited to the warp. So if the number of loop iterations were, for example, the thread id (silly example), then the “damage” would be limited to 32, so some “thread divergence counter” would start at 0 again for the next warp.

third para, could you point me to some code references or papers? I don’t grok this at all.

please check out two of our papers
https://www.osapublishing.org/oe/fulltext.cfm?uri=oe-17-22-20178&id=187243

https://www.spiedigitallibrary.org/journals/journal-of-biomedical-optics/volume-23/issue-01/010504/Scalable-and-massively-parallel-Monte-Carlo-photon-transport-simulations-for/10.1117/1.JBO.23.1.010504.full?SSO=1#ArticleLink

Fig. 1 in the first paper shows the loop-structure I used in my code, located in this unit

this code (mcx) performs Monte Carlo simulations of photons - each photon moves in incremental steps sequentially in voxelated grid; for each step, it performs ray-tracing to determine which boundary it hits, how long it travels, how much the energy attenuates and, if arriving at the end of the current path segment, what are the next scattering angles/length.

the kernel is structured to not simulate a single photon per thread, but a group of gcfg->threadphoton photons. Some photons navigate in the domain for thousands of steps, while others move only a dozen steps, but if you bundle the jobs, the load in-balance between threads is much less.

In the 2nd paper above (2018, JBO), we explored the idea of dynamic load balancing by assigning a block the total photons to be simulated by that block (instead of each thread), and decrement that shared-mem counter blockphoton[0] by 1 once a thread checkouts a photon.


The results in Fig. 3a in the paper show that this group-level load balancing also adds some small benefits in addition to thread-level load balancing.