Cooperative group tiled partitions terminate early

In my CUDA code, I have an example that some tiled partitions could terminate early based on some data values. I wonder what will happen internally here?

namespace cg = cooperative_groups;

__device__ void kernel2(cg::thread_block_tile<2> g, int *data) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x, val = data[tid];

    // tiled-partitioned-level sync
    for (int offset = g.size() / 2; offset > 0; offset /= 2) {
        val += g.shfl_down(val, offset);
    val = g.shfl(val, 0);

    // tiled partitions terminate early based on condition
    if (val > 10) {

    // continue do something else ...

__global__ void kernel1(int *data) {
    // tiled partition with group size of 2
    auto g = cg::tiled_partition<2>(cg::this_thread_block());
    kernel2(g, data);

Here is a simplified example code, which demonstrates the premise of my CUDA code. In this example, each tiled partition has only two threads. Those two threads do some operations and the tiled partition exits based on some conditions. So there can be some tiled partitions that are still executing, while some tiled partitions have decided to exit.

I understand that CUDA threads operate in a granularity of a warp (32 threads). 32 threads are equivalent to 16 tiled partitions in this case. If 3/16 tiled partitions decide to exit but 13/16 decide to continue, will CUDA executes as if 3 tiled partitions have a branch divergence? In my case, I observe that this seems to cause some nondeterministic behavior.

In general, there is nothing wrong with only a subset of threads returning from a device function. Whether your posted snipped is correct or not depends on the code you have not shown (“do something else”)

Exited threads do not create or cause divergence. The warp scheduler will not replay instructions purely on behalf of exited threads.

Thanks for your reply.

So in my example, the SP that executes such workload will only be 81.25% (26/32) occupied? Because other 6 threads have already exited. I am asking because I have came across a few papers discussing about threads regrouping or warp re-formulation. I wonder does CUDA do any of those optimizations internally?

I have no idea what that means. In CUDA-speak, an SP is a CUDA core which is basically a floating-point ALU.

I’ve never heard those statements used with CUDA.

I don’t see how any of this relates to your original question.