How does CUDA synchronize a while loop in a kernel?

I just happened to see the following code and started to think about how CUDA synchronize a while loop:

__device__ int atomicAdd(int *address, int val)
{
    int old = *address, assumed;
    do
    {
        assumed = old;
        old = atomicCAS(address,
        assumed, val + assumed);
    }
    while(assumed != old);
    return old;
}

When a warp execution reaches the comparison assumed != old, there exists a unique thread that doesn’t satisfy this condition, and the program pointer will point to the start of return old (probably a pop command). Other threads will be directed to assumed = old. Does the warp execute to the point where all threads in the warp exits the while loop so that the program pointers point to the same command?

This question may go back to a basic question: how does CUDA treat branch deviation? My guess is that the condition for a warp to exit is that all program pointers point to the same place, although this requires confirmation.

This is unspecified by CUDA and may vary from GPU architecture to GPU architecture, CUDA version to CUDA version, etc.

If you want to get some granular understanding here, it’s necessary to look at the SASS code generated by the compiler. There are various examples of this in questions on various forums.

But even if you arrive at a conclusion, that observation may not be universal.

Particularly in the Volta execution model, all bets are off. The proper thought process is:

  1. Threads can execute in any order.
  2. Say item 1 again.
  3. Answer any questions with item 1 as the answer.
  4. If your code requires anything else, and you haven’t provided specifically for it (via execution barriers, etc.), your code is broken.

Does the warp execute to the point where all threads in the warp exits the while loop so that the program pointers point to the same command?

Not necessarily. Threads can execute in any order. The execution engine is free to take a single thread, and schedule it over and over and over again. (You might ask, "well is anything guaranteed to break the cycle? " Yes. A partial list would be that exited threads are not eligible for scheduling, uneligible threads (e.g. stalled) are not eligible for scheduling, and threads waiting at an explicit execution barrier are not eligible for scheduling – until the barrier is satisfied.)

how does CUDA treat branch deviation?

Threads can execute in any order.

My guess is that the condition for a warp to exit is that all program pointers point to the same place, although this requires confirmation.

No. A warp doesn’t “exit”. A thread does. Threads can execute in any order.

Any expectation that you have, that CUDA will synchronize something for you, when you have not explicitly provided for synchronization, is a dangerous and broken thought process.

Yes, frequently, warps execute in lockstep. This is for performance reasons, not based on any requirement or expectation. The CUDA compiler and execution engine may seek to schedule things in a way that allows for the earliest possible reconvergence of the warp, for performance reasons. But there is no requirement or specification provided by CUDA to do so.

I have long had a question. Is it helpful to read PTX code to understand execution? What is the biggest difference between PTX and SASS?

What is the biggest difference between PTX and SASS?

PTX is an intermediate “code” that imagines a virtual machine model. PTX must be compiled to SASS before it can actually execute on a GPU. Furthermore, the tool that converts PTX to SASS is an optimizing compiler. There may not be much resemblance between the PTX (input) and the SASS (output). For these reasons, I don’t generally recommend that people depend on PTX for any kind of machine (behavior, performance) analysis.

Is it helpful to read PTX code to understand execution?

You can certainly learn things about the virtual architecture and general behavior from learning PTX; there are many important concepts that can be picked up this way.

And although there is no one for one correspondence between PTX instructions as outlined in the PTX ISA doc, and the generated SASS instructions, one can gain a decent understanding of what is happening.

As Robert mentions, particularly at higher nvcc optimisation levels, the SASS can become very disjointed from a flow perspective.

2 Likes

Good point. Because there is no in-depth documentation for SASS, if you want to learn about what SASS instructions do, the PTX docs are the next best alternative.

1 Like