Since the Volta architecture, each individual thread has its own program counter and call stack, allowing the warp scheduler to interleave the execution of different branches in divergent code. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
But can it reconverge threads of different branches that want to execute the same instruction, or can it just do it if the threads have the same program counter value ? In the example code below, could A and B be executed on the same cycle ? (supposing that the compiler does not destroy this code)
float v = 1;
if (threadIdx.x % 2) {
v += 1; // A
} else {
v += 1; // B
}
Would that be different for function calls ? If the function does not get inlined then diverging threads that call the same function should have the same program counter value once in the function, so theoretically Independant Thread Scheduling could reconverge them ?
void f() { ... }
if (threadIdx.x % 2) {
f(); // A
} else {
f(); // B
}
I tried testing it in Vulkan by mesuring clocks but the results are inconclusive. Do you have more info on this ?
separate (compiled) instructions cannot be executed in the same cycle. There is an exception of sorts, for certain instructions such as warp shuffle (the synchronous variant), but it does not apply generally to instructions.
reconvergence is generally accomplished by the compiler inserting specific instructions (sync barriers) into the code.
Having said that, I don’t know what the compiler would do with the code you have shown. I’m not certain about it; its possible it might observe the similarity in the conditional branches and remove the conditional aspect of it (probably, predication in this case). I’ve not looked into that.
1 Like
Thanks for the quick answer.
So what would happen if you inserted a __syncwarp() in both branches ?
float v = 1;
if (threadIdx.x % 2) {
__syncwarp();
v += 1;
} else {
__syncwarp();
v += 1;
}
Or at the beginning of the function call for the second example ?
void f() {
__syncwarp();
...
}
if (threadIdx.x % 2) {
f();
} else {
f();
}
Illegal code, undefined behavior. This is described in the programming guide.
Well the CUDA programming guide doesn’t actually say anything for compute capabilities > sm_6x
For .target sm_6x or below, all threads in mask must execute the same __syncwarp() in convergence, and the union of all values in mask must be equal to the active mask. Otherwise, the behavior is undefined.
Yes, correct. For ITS-enabled architectures, the syncwarp instruction itself is one of those special cases that can have its mask satisfied (i.e. not trigger UB) for specific cases (when the mask is the same between paths) by instructions in separate paths. Kind of like warp shuffle
That does not necessarily mean that subsequent instructions in those separate paths will be issued in convergence. It does mean that the barrier synchronization effect of __syncwarp() is legal and “applied” when subsequent instructions begin to execute (in convergence or not).
If the encountered mask is not the same, then the behavior is still illegal, but it does not apply to the example you have shown.
1 Like
I (IMHO and without inside knowledge and _not_ working at Nvidia) would expect any such convergence (e.g. same function called) either not working (not converging) at all or working (converging) by accident and only for a short duration. I would not trust it for any performance optimization.
Some checks for convergence could perhaps accept those cases, if correct functionality is still guaranteed.
But even if it does, it could stop to work with the next compiler or driver revision or from one HW to the next.
Rewrite your algorithm to explicitly converge and execute the same lines of code. Move the increment or the f() function call outside the if clause. Repeat the if clause afterwards. Etc. You can combine that approach with __syncwarp()
1 Like
code at the cutting edge of what works is my specialty ^^
this is for theoretical research not production anything
Ok I tried it in CUDA, and the warp does synchronize if it is true function call and the called function starts with a __syncwarp() ! clock64() return the same time and __activemask() even says that all threads are active. It does not work without the __noinline__ or __syncwarp().
__noinline__
__device__
void func(int64_t* clock, uint32_t* mask) {
__syncwarp();
*clock = clock64();
*mask = __activemask();
}
__global__
void reconvergenceTest(bool* stop) {
int64_t clock;
uint32_t mask;
if (threadIdx.x % 2) {
func(&clock, &mask);
} else {
func(&clock, &mask);
}
int pred;
bool same = __match_all_sync((uint)-1, clock, &pred);
if (threadIdx.x == 0) {
printf("%s, clock %lli, mask %08x\n",
same ? "same" : "different",
clock,
mask
);
if (same) {
*stop = true;
}
}
}
int main(int argc, char** argv) {
bool* stop;
cudaHostAlloc(&stop, sizeof(stop), cudaHostAllocMapped);
*stop = false;
for (int i = 0; !*stop; i++) {
reconvergenceTest<<<1, 32>>>(stop);
}
cudaDeviceSynchronize();
return 0;
}
It does not look like the compiler removed the branch, the PTX clearly contains it, and I’m not good at reading SASS but Claud decompilation attempt indicated the same. I’m not advocating for this style of code, but it’s interesting that it can work at least a little bit :)
1 Like