Cooperative groups of size only known in run-time

Good afternoon, all.
As once suggested by Jimmy, I was reading about cooperative groups and thought of ways to partition and synchronize groups of threads that will work on data with a format only known in run-time, when I am about to call the kernel.

For example, if I have a 2D matrix of 1050 lines by 1000 columns, and want to do a reduction on each column, I can’t simply use the traditional reduction as it sweeps the entire dataset. The 1050 lines example is intentional, as in the case of a column with more than 1024 lines, which I expect very often, I can’t rely on the block size being <= 1024 so a column would totally fit in a block.
The sum of each column will be accumulated in an array with the same number of columns of the input.

When I do the traditional:

size_t global_idx = blockDim.x * blockIdx.x + threadIdx.x,
       offset = gridDim.x * blockDim.x;

while(global_idx < LENGTH)
    // do something
    global_idx += offset

I can determine in which column and line the thread “global_idx” is. This will help reducing the columns when a given thread block is between the tail of a column and the head of another. Some designated threads will do the sum with SHMEM and only one will do the atomicAdd to this array (in case part of the column is in another thread block).

With this information, finally, I’d like to ask you:

  • Is it possible to create one or more cooperative groups based on a decision I make evaluating global_idx? That is, I don’t know the number of groups or threads of a group in compile-time.