In my CUDA code, I have an example that some tiled partitions could terminate early based on some data values. I wonder what will happen internally here?
namespace cg = cooperative_groups;
__device__ void kernel2(cg::thread_block_tile<2> g, int *data) {
int tid = blockDim.x * blockIdx.x + threadIdx.x, val = data[tid];
// tiled-partitioned-level sync
for (int offset = g.size() / 2; offset > 0; offset /= 2) {
val += g.shfl_down(val, offset);
}
val = g.shfl(val, 0);
// tiled partitions terminate early based on condition
if (val > 10) {
return;
}
// continue do something else ...
}
__global__ void kernel1(int *data) {
// tiled partition with group size of 2
auto g = cg::tiled_partition<2>(cg::this_thread_block());
kernel2(g, data);
}
Here is a simplified example code, which demonstrates the premise of my CUDA code. In this example, each tiled partition has only two threads. Those two threads do some operations and the tiled partition exits based on some conditions. So there can be some tiled partitions that are still executing, while some tiled partitions have decided to exit.
I understand that CUDA threads operate in a granularity of a warp (32 threads). 32 threads are equivalent to 16 tiled partitions in this case. If 3/16 tiled partitions decide to exit but 13/16 decide to continue, will CUDA executes as if 3 tiled partitions have a branch divergence? In my case, I observe that this seems to cause some nondeterministic behavior.