Cooperative group tiled partitions terminate early

yingweidadadadadada · December 2, 2021, 3:29pm

In my CUDA code, I have an example that some tiled partitions could terminate early based on some data values. I wonder what will happen internally here?

namespace cg = cooperative_groups;

__device__ void kernel2(cg::thread_block_tile<2> g, int *data) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x, val = data[tid];

    // tiled-partitioned-level sync
    for (int offset = g.size() / 2; offset > 0; offset /= 2) {
        val += g.shfl_down(val, offset);
    }
    val = g.shfl(val, 0);

    // tiled partitions terminate early based on condition
    if (val > 10) {
        return;    
    }

    // continue do something else ...
}

__global__ void kernel1(int *data) {
    // tiled partition with group size of 2
    auto g = cg::tiled_partition<2>(cg::this_thread_block());
    kernel2(g, data);
}

Here is a simplified example code, which demonstrates the premise of my CUDA code. In this example, each tiled partition has only two threads. Those two threads do some operations and the tiled partition exits based on some conditions. So there can be some tiled partitions that are still executing, while some tiled partitions have decided to exit.

I understand that CUDA threads operate in a granularity of a warp (32 threads). 32 threads are equivalent to 16 tiled partitions in this case. If 3/16 tiled partitions decide to exit but 13/16 decide to continue, will CUDA executes as if 3 tiled partitions have a branch divergence? In my case, I observe that this seems to cause some nondeterministic behavior.

striker159 · December 2, 2021, 4:40pm

In general, there is nothing wrong with only a subset of threads returning from a device function. Whether your posted snipped is correct or not depends on the code you have not shown (“do something else”)

Robert_Crovella · December 2, 2021, 5:16pm

Exited threads do not create or cause divergence. The warp scheduler will not replay instructions purely on behalf of exited threads.

yingweidadadadadada · December 6, 2021, 5:14pm

Thanks for your reply.

So in my example, the SP that executes such workload will only be 81.25% (26/32) occupied? Because other 6 threads have already exited. I am asking because I have came across a few papers discussing about threads regrouping or warp re-formulation. I wonder does CUDA do any of those optimizations internally?

Robert_Crovella · December 6, 2021, 5:50pm

I have no idea what that means. In CUDA-speak, an SP is a CUDA core which is basically a floating-point ALU.

I’ve never heard those statements used with CUDA.

I don’t see how any of this relates to your original question.

Topic		Replies	Views
What is the usage of cooperative group? CUDA Programming and Performance	10	1321	September 19, 2023
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21238	March 12, 2007
Cooperative groups are much slower than CUB CUDA Programming and Performance cuda	6	265	December 16, 2024
Does the grid_sync in cooperative groups have the same functionality as the device-wide synchronization? CUDA Programming and Performance	11	1359	March 20, 2024
Threads should be run in groups of 32? CUDA Programming and Performance	9	5881	September 29, 2008
Cooperative Groups: Flexible CUDA Thread Programming Technical Blog	32	12475	February 7, 2023
Cuda: threads over 2 warps not synchronising correctly Legacy PGI Compilers	5	6888	May 26, 2011
How to tell the cooperative groups api that a kernel will be launched as 1D? CUDA Programming and Performance	3	264	January 21, 2024
Re: Cooperative groups are much slower than CUB CUDA Programming and Performance	1	33	December 16, 2024
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18567	December 13, 2007

Cooperative group tiled partitions terminate early

Related topics