There is no robust way to do inter-block synchronization in the CUDA programming model, as blocks could even execute serially under that model, leading to deadlocks as described by seibert above.
The robust way to achieve the desired functionality is to launch two kernels, one for each stage of the two-stage reduction. While the second kernel often runs with very low efficiency in such a setup, it also tends to run very briefly, so that overall efficiency of the reduction is completely dominated by the more expensive first stage of the reduction.