is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.
is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.
Seems like it would work, but there’s a faster solution:
__global__ void computeEliminatedArray(double *d_fr_iteration, int nDim, int *eliminated) {
__shared__ int flag;
flag = 0;
__syncthreads();
if (myfunction(...) == 1)
flag = 1;
__syncthreads();
if (threadIdx.x==0)
eliminated[blockIdx.x] = flag;
}
Seems like it would work, but there’s a faster solution:
__global__ void computeEliminatedArray(double *d_fr_iteration, int nDim, int *eliminated) {
__shared__ int flag;
flag = 0;
__syncthreads();
if (myfunction(...) == 1)
flag = 1;
__syncthreads();
if (threadIdx.x==0)
eliminated[blockIdx.x] = flag;
}
Replacing font=“Courier New”++;[/font] with [font=“Courier New”]atomicAdd(&d_count, 1);[/font] would make it work, but be very slow. See the reduction example from the SDK for how to make it faster - the basic idea is to have each thread increment a register (which is a fast operation), and then use slow operations to add the sums from each thread only at the end.
Replacing font=“Courier New”++;[/font] with [font=“Courier New”]atomicAdd(&d_count, 1);[/font] would make it work, but be very slow. See the reduction example from the SDK for how to make it faster - the basic idea is to have each thread increment a register (which is a fast operation), and then use slow operations to add the sums from each thread only at the end.