add elements of array has any body implemented

hi all,

has any body implemented the addition of elements of entire array example in programing guide 2.2 page 117 …

B.5 Memory Fence Functions
void __threadfence();
CUDA Programming Guide Version 2.2
C Extensions
Appendix B.
waits until all global and shared memory accesses made by the calling thread prior to
__threadfence() are visible to all threads in the device.
void __threadfence_block();
waits until all global and shared memory accesses made by the calling thread prior to
__threadfence_block() are visible to all threads in the thread block.
In general, when a thread issues a series of writes to memory in a particular order,
other threads may see the effects of these memory writes in a different order.
__threadfence() and __threadfence_block() can be used to enforce
some ordering.
One use case is when threads consume some data produced by other threads as
illustrated by the following code sample of a kernel that computes the sum of an
array of N numbers in one call. Each block first sums a subset of the array and
stores the result in global memory. When all blocks are done, the last block done
reads each of these partial sums from global memory and sums them to obtain the
final result. In order to determine which block is finished last, each block atomically
increments a counter to signal that it is done with computing and storing its partial
sum (see Section B.10 about atomic functions). The last block is the one that
receives the counter value equal to gridDim.x-1. If no fence is placed between
storing the partial sum and incrementing the counter, the counter might increment
before the partial sum is stored and therefore, might reach gridDim.x-1 and let
the last block start reading partial sums before they have been actually updated in
device unsigned int count = 0;
shared bool isLastBlockDone;
global void sum(const float* array, unsigned int N,
float* result)
// Each block sums a subset of the input array
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
// Thread 0 of each block stores the partial sum
// to global memory
result[blockIdx.x] = partialSum;
// Thread 0 makes sure its result is visible to
// all other threads
// Thread 0 of each block signals that it is done
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 of each block determines if its block is
// the last block to be done
isLastBlockDone = (value == (gridDim.x - 1));
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone
CUDA Programming Guide Version 2.2 109
C Extensions
Appendix B.
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 … gridDim.x-1]
float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores total sum
// to global memory and resets count so that
// next kernel call works properly
result[0] = totalSum;
count = 0;

hello biebo,

you need to take a look at the sdk example folder called “reduced”
it provides code & documentation (in the form of ppt slides) and first starts w/ a naive implementation then goes through 7 different optimizations and the final implementation is 30x faster than the first