hi,
is the follow up post on the grid_group's already available?
Questions (1 and 2) below might be showing a misunderstanding on my part.
1) Should the comment "Each thread adds its partial sum[i] to sum[lane+i]" be something like "Each thread adds the partial sum[lane+i] to its accumulator sum[lane] (only lane 0 will have the full accumulated value)"?
2) For sum_kernel_block(), doesn't thread_sum() assumes n is divisible by 4, and doesn't the formula for nBlocks require n/4?
3) Is the optimization of loop unrolling a -O2 feature or a -O3 feature? (My project can only use -O2.)
4) Is the optimization of removing the synchronization statement for warps a -O2 feature or a -O3 feature?
5) After a vector-4 load, if a device function foo() is called as foo(v.x); foo(v.y); foo(v.z); foo(v.w);, will the compiler optimize across the four invocations of foo() or will each invocation be treated as a "basic block of optimization"? Please consider both optimization flags of -O2 and -O3.
6) Please consider updating the grid-size loop blog with a section on threads working on vectors.
.
To make this reduction compatible with input that is not divisible by 4, thread_sum() needs to be modified:
__device__ int thread_sum(int *input, int n)
{
int sum = 0;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = idx; i < n / 4; i += blockDim.x * gridDim.x)
{
int4 in = ((int4*)input)[i];
sum += in.x + in.y + in.z + in.w;
}
int i = idx + n / 4 * 4;
if(i < n)
sum += input[i];
return sum;
}
To fix 2), see my comment on the main article.
Are the .match_any() and .match_all() methods available on all generations? I know the definition of the intrinsics came by in Volta and newer generation, but can I make a function that does a similar job on previous architectures?
Hello, any follow up regarding grid sync or device sync available?
I am also looking for variant of grid synchronisation like the following:
I would like to achieve synchronisation among the active thread blocks scheduled on a GPU.
Is this possible to do with the current co-operative thread grouping and grid synchronisation concept ?
My requirement is that a current scheduled thread blocks co-operatively load a memory segment into shared memory and then compute and then synchronize until both are complete…
I found two new functions in cooperative_groups. Primarily, sync_grids and sync_warp (inside include/cooperative_groups/sync.h). I wanted to know if there are any opensource or public projects that use these primitives. Can someone point to those?
the last part of atomicAggInc, shouldn’t the return
int prev;
be a int* prev ?
reference code:
__device__
int atomicAggInc(int *ptr)
{
cg::coalesced_group g = cg::coalesced_threads();
int prev;
// elect the first active thread to perform atomic add
if (g.thread_rank() == 0) {
prev = atomicAdd(ptr, g.size());
}
// broadcast previous value within the warp
// and add each active thread’s rank to it
prev = g.thread_rank() + g.shfl(prev, 0);
return prev;
}
No, it returns the value, not the pointer. The pointer should not change. “prev” probably isn’t a good name for the variable, though.
Oh I see, It’s my fault. I misunderstood the function of atomicAdd: it’s first parameter should be a pointer not a value. Thank you for the reply.
By the way, do we need to add a cg::sync() after the if statement ? Because in theory, as far as I can see, we should add a sync here. And I wonder if the compiler will add a sync for us and we don’t have to write it down explicitly?
__device__
int atomicAggInc(int *ptr)
{
cg::coalesced_group g = cg::coalesced_threads();
int prev;
// elect the first active thread to perform atomic add
if (g.thread_rank() == 0) {
prev = atomicAdd(ptr, g.size());
}
///////////////// g.sync() here? or not have to?
// broadcast previous value within the warp
// and add each active thread’s rank to it
prev = g.thread_rank() + g.shfl(prev, 0);
return prev;
}
Hi Is it available on cuda-11 ? and can anyone tell me what is the use of vec3_to_linear call ?
This looks like it was fixed but seems to have reverted; the article says cg::partition
again.