A question about calculatePartialSum sample code in CUDA_C programming guide

This type of technique is sometimes referred to as a threadFence reduction or a block-draining reduction. A corresponding (complete) sample code is here.

The general intra-block reduction strategy (what calculatePartialSum() does) need not be directly connected to the threadFence method, which is why a full fleshed example in the programming guide is not provided. You can quickly learn how to write the code for it yourself using canonical material such as here.

However if you prefer, we could connect the two examples by saying that calculatePartialSum() would be approximately equivalent to the reduceBlock() code here, and to increase similarity, we could also posit that we would modify the reduceBlock() prototype as follows:

__device__ float reduceBlock(volatile float *sdata, float mySum,
                        const unsigned int tid, cg::thread_block cta) {

and at the end of that function we would include the following statement:

return sdata[0];