Cooperative Groups: Flexible CUDA Thread Programming

hi,
is the follow up post on the grid_group's already available?

Questions (1 and 2) below might be showing a misunderstanding on my part.

1) Should the comment "Each thread adds its partial sum[i] to sum[lane+i]" be something like "Each thread adds the partial sum[lane+i] to its accumulator sum[lane] (only lane 0 will have the full accumulated value)"?

2) For sum_kernel_block(), doesn't thread_sum() assumes n is divisible by 4, and doesn't the formula for nBlocks require n/4?

3) Is the optimization of loop unrolling a -O2 feature or a -O3 feature? (My project can only use -O2.)

4) Is the optimization of removing the synchronization statement for warps a -O2 feature or a -O3 feature?

5) After a vector-4 load, if a device function foo() is called as foo(v.x); foo(v.y); foo(v.z); foo(v.w);, will the compiler optimize across the four invocations of foo() or will each invocation be treated as a "basic block of optimization"? Please consider both optimization flags of -O2 and -O3.

6) Please consider updating the grid-size loop blog with a section on threads working on vectors.

.

To make this reduction compatible with input that is not divisible by 4, thread_sum() needs to be modified:


__device__ int thread_sum(int *input, int n)
{
int sum = 0;
int idx = blockIdx.x * blockDim.x + threadIdx.x;


for(int i = idx; i < n / 4; i += blockDim.x * gridDim.x)
{

int4 in = ((int4*)input)[i];

sum += in.x + in.y + in.z + in.w;

}
int i = idx + n / 4 * 4;

if(i < n)

sum += input[i];


return sum;

}

To fix 2), see my comment on the main article.

Are the .match_any() and .match_all() methods available on all generations? I know the definition of the intrinsics came by in Volta and newer generation, but can I make a function that does a similar job on previous architectures?

Hello, any follow up regarding grid sync or device sync available?

I am also looking for variant of grid synchronisation like the following:

I would like to achieve synchronisation among the active thread blocks scheduled on a GPU.
Is this possible to do with the current co-operative thread grouping and grid synchronisation concept ?

My requirement is that a current scheduled thread blocks co-operatively load a memory segment into shared memory and then compute and then synchronize until both are complete…

I found two new functions in cooperative_groups. Primarily, sync_grids and sync_warp (inside include/cooperative_groups/sync.h). I wanted to know if there are any opensource or public projects that use these primitives. Can someone point to those?

the last part of atomicAggInc, shouldn’t the return

int prev;

be a int* prev ?

reference code:

__device__
int atomicAggInc(int *ptr)
{
    cg::coalesced_group g = cg::coalesced_threads();
    int prev;

    // elect the first active thread to perform atomic add
    if (g.thread_rank() == 0) {
        prev = atomicAdd(ptr, g.size());
    }

    // broadcast previous value within the warp
    // and add each active thread’s rank to it
    prev = g.thread_rank() + g.shfl(prev, 0);
    return prev;
}

No, it returns the value, not the pointer. The pointer should not change. “prev” probably isn’t a good name for the variable, though.

Oh I see, It’s my fault. I misunderstood the function of atomicAdd: it’s first parameter should be a pointer not a value. Thank you for the reply.

By the way, do we need to add a cg::sync() after the if statement ? Because in theory, as far as I can see, we should add a sync here. And I wonder if the compiler will add a sync for us and we don’t have to write it down explicitly?

__device__
int atomicAggInc(int *ptr)
{
    cg::coalesced_group g = cg::coalesced_threads();
    int prev;

    // elect the first active thread to perform atomic add
    if (g.thread_rank() == 0) {
        prev = atomicAdd(ptr, g.size());
    }

   ///////////////// g.sync() here? or not have to?

    // broadcast previous value within the warp
    // and add each active thread’s rank to it
    prev = g.thread_rank() + g.shfl(prev, 0);
    return prev;
}

Hi Is it available on cuda-11 ? and can anyone tell me what is the use of vec3_to_linear call ?

This looks like it was fixed but seems to have reverted; the article says cg::partition again.