This is unspecified by CUDA and may vary from GPU architecture to GPU architecture, CUDA version to CUDA version, etc.
If you want to get some granular understanding here, it’s necessary to look at the SASS code generated by the compiler. There are various examples of this in questions on various forums.
But even if you arrive at a conclusion, that observation may not be universal.
Particularly in the Volta execution model, all bets are off. The proper thought process is:
- Threads can execute in any order.
- Say item 1 again.
- Answer any questions with item 1 as the answer.
- If your code requires anything else, and you haven’t provided specifically for it (via execution barriers, etc.), your code is broken.
Does the warp execute to the point where all threads in the warp exits the while loop so that the program pointers point to the same command?
Not necessarily. Threads can execute in any order. The execution engine is free to take a single thread, and schedule it over and over and over again. (You might ask, "well is anything guaranteed to break the cycle? " Yes. A partial list would be that exited threads are not eligible for scheduling, uneligible threads (e.g. stalled) are not eligible for scheduling, and threads waiting at an explicit execution barrier are not eligible for scheduling – until the barrier is satisfied.)
how does CUDA treat branch deviation?
Threads can execute in any order.
My guess is that the condition for a warp to exit is that all program pointers point to the same place, although this requires confirmation.
No. A warp doesn’t “exit”. A thread does. Threads can execute in any order.
Any expectation that you have, that CUDA will synchronize something for you, when you have not explicitly provided for synchronization, is a dangerous and broken thought process.
Yes, frequently, warps execute in lockstep. This is for performance reasons, not based on any requirement or expectation. The CUDA compiler and execution engine may seek to schedule things in a way that allows for the earliest possible reconvergence of the warp, for performance reasons. But there is no requirement or specification provided by CUDA to do so.