Is Independent Thread Scheduling reconvergence based on Program Counter or Instruction?

antoinerichermoz · October 15, 2025, 11:33am

Since the Volta architecture, each individual thread has its own program counter and call stack, allowing the warp scheduler to interleave the execution of different branches in divergent code. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

But can it reconverge threads of different branches that want to execute the same instruction, or can it just do it if the threads have the same program counter value ? In the example code below, could A and B be executed on the same cycle ? (supposing that the compiler does not destroy this code)

float v = 1;
if (threadIdx.x % 2) {
    v += 1; // A
} else {
    v += 1; // B
}

Would that be different for function calls ? If the function does not get inlined then diverging threads that call the same function should have the same program counter value once in the function, so theoretically Independant Thread Scheduling could reconverge them ?

void f() { ... }
if (threadIdx.x % 2) {
   f(); // A
} else {
   f(); // B
}

I tried testing it in Vulkan by mesuring clocks but the results are inconclusive. Do you have more info on this ?

Robert_Crovella · October 15, 2025, 1:05pm

separate (compiled) instructions cannot be executed in the same cycle. There is an exception of sorts, for certain instructions such as warp shuffle (the synchronous variant), but it does not apply generally to instructions.

reconvergence is generally accomplished by the compiler inserting specific instructions (sync barriers) into the code.

Having said that, I don’t know what the compiler would do with the code you have shown. I’m not certain about it; its possible it might observe the similarity in the conditional branches and remove the conditional aspect of it (probably, predication in this case). I’ve not looked into that.

antoinerichermoz · October 15, 2025, 1:16pm

Thanks for the quick answer.

So what would happen if you inserted a __syncwarp() in both branches ?

float v = 1;
if (threadIdx.x % 2) {
    __syncwarp();
    v += 1;
} else {
    __syncwarp();
    v += 1;
}

Or at the beginning of the function call for the second example ?

void f() {
     __syncwarp();
    ...
}
if (threadIdx.x % 2) {
   f();
} else {
   f();
}

Robert_Crovella · October 15, 2025, 1:20pm

Illegal code, undefined behavior. This is described in the programming guide.

antoinerichermoz · October 15, 2025, 1:35pm

Well the CUDA programming guide doesn’t actually say anything for compute capabilities > sm_6x

For .target sm_6x or below, all threads in mask must execute the same __syncwarp() in convergence, and the union of all values in mask must be equal to the active mask. Otherwise, the behavior is undefined.

Robert_Crovella · October 15, 2025, 1:42pm

Yes, correct. For ITS-enabled architectures, the syncwarp instruction itself is one of those special cases that can have its mask satisfied (i.e. not trigger UB) for specific cases (when the mask is the same between paths) by instructions in separate paths. Kind of like warp shuffle

That does not necessarily mean that subsequent instructions in those separate paths will be issued in convergence. It does mean that the barrier synchronization effect of __syncwarp() is legal and “applied” when subsequent instructions begin to execute (in convergence or not).

If the encountered mask is not the same, then the behavior is still illegal, but it does not apply to the example you have shown.

Curefab · October 15, 2025, 6:30pm

I (IMHO and without inside knowledge and _not_ working at Nvidia) would expect any such convergence (e.g. same function called) either not working (not converging) at all or working (converging) by accident and only for a short duration. I would not trust it for any performance optimization.

Some checks for convergence could perhaps accept those cases, if correct functionality is still guaranteed.

But even if it does, it could stop to work with the next compiler or driver revision or from one HW to the next.

Rewrite your algorithm to explicitly converge and execute the same lines of code. Move the increment or the f() function call outside the if clause. Repeat the if clause afterwards. Etc. You can combine that approach with __syncwarp()

antoinerichermoz · October 15, 2025, 8:09pm

code at the cutting edge of what works is my specialty ^^
this is for theoretical research not production anything

antoinerichermoz · October 16, 2025, 9:40am

Ok I tried it in CUDA, and the warp does synchronize if it is true function call and the called function starts with a __syncwarp() ! clock64() return the same time and __activemask() even says that all threads are active. It does not work without the __noinline__ or __syncwarp().

__noinline__
__device__
void func(int64_t* clock, uint32_t* mask) {
    __syncwarp();
    *clock = clock64();
    *mask = __activemask();
}

__global__
void reconvergenceTest(bool* stop) {
    int64_t clock;
    uint32_t mask;

    if (threadIdx.x % 2) {
        func(&clock, &mask);
    } else {
        func(&clock, &mask);
    }

    int pred;
    bool same = __match_all_sync((uint)-1, clock, &pred);

    if (threadIdx.x == 0) {
        printf("%s, clock %lli, mask  %08x\n",
            same ? "same" : "different",
            clock,
            mask
        );

        if (same) {
            *stop = true;
        }
    }
}

int main(int argc, char** argv) {
    bool* stop;
    cudaHostAlloc(&stop, sizeof(stop), cudaHostAllocMapped);

    *stop = false;
    for (int i = 0; !*stop; i++) {
        reconvergenceTest<<<1, 32>>>(stop);
    }

    cudaDeviceSynchronize();

    return 0;
}

It does not look like the compiler removed the branch, the PTX clearly contains it, and I’m not good at reading SASS but Claud decompilation attempt indicated the same. I’m not advocating for this style of code, but it’s interesting that it can work at least a little bit :)

Topic		Replies	Views
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21410	March 12, 2007
order of execution in a divergent warp CUDA Programming and Performance	27	4848	January 17, 2012
Warp-synchronous programming in the presence of warp divergence CUDA Programming and Performance	11	3200	April 18, 2011
global to shared mem loads and sync CUDA Programming and Performance	26	11819	February 21, 2008
Will threads in a warp merged again after divergence happened? CUDA Programming and Performance	3	505	February 10, 2023
threads diverging in a loop when does a loop cause divergance CUDA Programming and Performance	13	21113	May 12, 2007
Shift direction and divergence CUDA Programming and Performance	7	501	November 13, 2020
execution within one diverged warp CUDA Programming and Performance	2	554	February 21, 2020
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	17037	January 5, 2008
Do all threads in a warp share the same PC? CUDA Programming and Performance	9	222	November 5, 2025

Is Independent Thread Scheduling reconvergence based on Program Counter or Instruction?

Related topics